Linux Cubed Series 3: Developer Tools

home *** CD-ROM | disk | FTP | other *** search

/ Linux Cubed Series 3: Developer Tools / Linux Cubed Series 3 - Developer Tools.iso / devel / db / esm-3.1 / esm-3 / usr / local / sm / doc / sm3doc.me < prev next >

Wrap

Text File | 1996-05-05 | 174.6 KB | 5,184 lines

\"Macro for putting levels 1 through 4 section headings in t.o.c. .de $0 .if \\$3=1 \{\ .(x \fB\\$2 \\$1\fR .)x \} .if \\$3=2 \{\ .(x \fB\\$2\fR \\$1 .)x \} .if \\$3=3 \{\ .(x \fB\\$2\fR \\$1 .)x \} .if \\$3=4 \{\ .(x \fB\\$2\fR \\$1 .)x \} .. \" end macro .\" use larger type so that it looks OK after photo-reducing .nr pp 12\" use larger point size .nr sp 12\" yep, I really mean it .nr tp 12\" and I'll mean it after other stuff .nr fp 10\" don't reset to 10 point (and use 9 footnotes) .sz 12\" believe me!!! .\" .\" USING THE EXODUS STORAGE MANAGER .\" .\"Macro the the storage manager version number .ds V "3.1 .po 1.0i .ll 6.5i .fo ''%'' .tp .sp 20 .ls 1 .ce 5 .sz 14 \fBUsing the EXODUS Storage Manager V\*V\fR .sz 12 (Last revision: November, 1993) .sp 3 .(f The Exodus software was developed primarily with funds provided by by the Defense Advanced Research Projects Agency under contracts N00014-85-K-0788, N00014-88-K-0303, and DAABO7-92-C-Q508 and monitored by the US Army Research Laboratory. Additional support was provided by Texas Instruments, Digital Equipment Corporation, and Apple Computer. .)f .bp .sh 1 "INTRODUCTION" .lp The EXODUS Storage Manager is a multi-user object storage system supporting versions, indexes, single-site transactions, distributed transactions, concurrency control, and recovery. This document provides information about using version \*V of the EXODUS Storage Manager. Information about installing the Storage Manager can be found in the \fIEXODUS Storage Manager Installation Manual\fR. Section 2 gives an overview of the system. Section 3 discusses configuration facilities. Section 4 describes, in detail, the Storage Manager's application interface. Section 5 describes how to use the Storage Manager server. Appendices provide more details on certain aspects of the system. A table of contents is located at the end of the document. .br .sh 1 "OVERVIEW OF THE EXODUS STORAGE MANAGER" .lp This section, an executive summary, briefly describes the architecture of the Storage Manager and gives an overview of the facilities provided to applications, .lp Version \*V of the Storage Manager runs on the following architectures: Sun 4 (Sparc) (under SunOS 4.1.[23]), DecStation 3100/5000 (MIPS) (under Ultrix 4.2), and HP 720 (under HP-UX A.08.07). The Storage Manager is written in C++ and had been checked for compilation under the GNU C++ compiler (g++), version 2.3.3 and 2.4.5. .sh 2 "Architecture" .lp The EXODUS Storage Manager has a client-server architecture. An application program that uses the Storage Manager may reside on a machine different from the machine or machines on which the Storage Manager server or servers run. .(x z application .)x \*($n We use the term \fIapplication\fR to refer to programs that use the Storage Manager through the client programming interface described in Section 4. We use the term \fIclient library\fR, or .(x z client, client library .)x \*($n \fIclient\fR, to refer to the Storage Manager code and data structures that are linked into the application program to support the client programming interface. The client allows applications to use the facilities described in the next sub-section. Each client has its own buffer pool for caching data. The client library connects to one or more server processes and communicates with them using a remote-procedure-call-style mechanism that runs over TCP. .lp The Storage Manager server is a multi-threaded process providing asynchronous I/O, file, transaction, concurrency control, and recovery services to multiple clients. The server stores all data on \fIvolumes\fR, which are either Unix files or raw disk partitions. The server is more completely described in Section 5 and in the \fIEXODUS Storage Manager Architecture Overview\fR [exoArch]. .br .sh 2 "Facilities" .lp The EXODUS Storage Manager provides \fIobjects\fR for storing data, \fIversions\fR of objects, \fIfiles\fR for grouping related objects, and \fIindexes\fR for supporting efficient object access. The Storage Manager also provides \fIvolumes\fR, \fItransactions\fR, \fIconcurrency control\fR, \fIrecovery\fR, and \fIconfiguration options\fR. These facilities are presented briefly in this section, and more information can be found in later sections of the document. .sh 3 "Objects" .lp An object is an uninterpreted container of bytes, which can range in size from a few bytes to a little less than the size of a disk. Internally, the Storage Manager distinguishes two types of objects. There are \fIsmall objects\fR, which are objects that fit on a single disk page, and \fIlarge objects\fR, which are objects that do not fit on a single disk page. Support is also provided for creating and manipulating versions of both small and large objects. To provide a uniform function call interface, the distinction between small, large, and versioned objects is hidden from applications. Applications are unaware of whether they are dealing with a small or large object, and the same interface functions are called to manipulate either type of object. To simplify the task of manipulating very large objects, the Storage Manager provides flexible buffer management that allows variable-length pieces of large objects to be buffered contiguously in the client buffer pool. .lp Objects have object identifiers. The object identifier of a small object points directly to the object on disk, while the object identifier of a large object points to a \fIlarge object header\fR. The header of a large object serves as the root of a B\*[+\*]tree .(x z B\*[+\*]tree index .)x \*($n .(x z index, B\*[+\*]tree .)x \*($n index structure that is used to access the object's data [Care86, Care89]. For space efficiency, a large object header can share a disk page with small objects and other large object headers. The data pages and the pages that make up the index structure of a large object are not shared, however. When a small object grows to the point where it can no longer be stored on a single page, the Storage Manager automatically converts it to a large object, leaving the new header in place of the original object. .lp The Storage Manager provides functions to read, overwrite, insert, delete, and append to an object. Read requests specify an object identifier and a range of bytes. The desired data is read into a contiguous region in the client buffer pool (even if is distributed over several disk pages), and a pointer to the data is returned to the caller. The overwrite function uses the pointer set up by a read request, and overwrites a subrange of the data. The insert and delete functions allow data to be inserted into and deleted from objects at arbitrary offsets, while the append function allows data to be appended to the end of an object. As mentioned earlier, large objects are represented using a B\*[+\*]tree index structure. This ensures that each of the above operations can be executed efficiently on large objects. .sh 3 "Versions" .lp A version of an object is another object that appears to be a copy of the original object. A version of a small object is a copy of the original object. A version of a large object is an object header with a pointer into the original object's data, until either the version or the original object is updated. When the large object version is updated, the affected portions of the original object are copied to prevent the original object from being affected by the update [Care89]. Although the version support described here is primitive, essentially providing \*(lqcopy-on-write\*(rq objects, it has been purposefully designed that way so that a variety of application-specific versioning schemes can be implemented on top of the Storage Manager. .sh 3 "Files" .lp Objects are allocated in \fIfiles\fR, which are collections of related objects. Files have three uses. .lp First, files are used for clustering objects. The objects in a file are stored on disk pages allocated solely to that file, so files provide a way to physically co-locate related objects on the disk. .lp Second, the Storage Manager provides an efficient way to \fIscan\fR the objects in a file, visiting each object exactly once. .lp Third, the Storage Manager offers an efficient mechanism for loading the objects into a file in bulk. .sh 3 "Indexes" .lp The Storage Manager provides B\*[+\*]tree indexes and linear hashing indexes. .(x z index, linear hashed .)x \*($n .(x z linear hashed index .)x \*($n .(x z B\*[+\*]tree index .)x \*($n .(x z index, B\*[+\*]tree .)x \*($n Index keys can be any basic C language data type or strings. Values can be any type of fixed length. .sh 3 "Volumes" .lp User data and Storage Manager meta-data (objects, files, indexes, and logs) are stored on volumes. A volume represents a disk, although in fact it may be a Unix raw disk partition or a Unix file. .lp Volumes can be \fItemporary\fR, which means that data stored on them are not logged, and they do not persist from one transaction to the next. Temporary volumes are meant to provide fast storage for temporary data. .sh 3 "Transactions" .lp A transaction is a set of operations on objects, files, and indexes. Transactions are either committed or aborted. Updates made by committed transactions are guaranteed to be reflected on stable storage, even in the event of software or processor failure. Updates made by aborted transactions are not reflected on stable storage. .lp Transactions that use data on more than one server are committed using a distributed two-phase commit protocol [Moha83]. .(x z two-phase commit protocol .)x \*($n .(x z transactions .)x \*($n .(x z transactions, distributed .)x \*($n .(x z distributed transactions .)x \*($n .sh 3 "Concurrency Control" .lp Concurrency control allow multiple client applications safely to use data simultaneously. Concurrency control is based on the standard hierarchical two-phase locking protocol providing degree-three consistency (see [Gray78, Gray88]). The lock hierarchy contains two granularities: file-level, and page-level. Locking for index operations is performed with a non-two-phase protocol, which allows multiple clients to read and update the same index. .lp Deadlocks involving more than one server are resolved through timeouts. .sh 3 "Recovery" .lp The Storage Manager recovers from software, operating system, and CPU failure by restoring data to a state in which all transactions have been committed or aborted. After an application fails, the transaction it is running is aborted by the servers that cooperated in the transaction. After a server fails and is restarted, updates made by committed transactions are restored, and updates by transactions in progress at the time of failure are undone. Recovery from media (disk) failure is not supported. .sh 3 "Configuration Options" .lp The Storage Manager client library and servers have \fIconfiguration options\fR, which can be set by users. These options control such things as parameters that affect performance and memory use, formats of volumes and logs, the choice of servers to be contacted by clients, and path names of installed executable files. .br .sh 2 "Illustration of Using the Storage Manager" .lp The purpose of this section is to give the reader a context in which to read the rest of this document. This section illustrates a way to get started using the Storage Manager. There are many ways to install, configure, and use the Storage Manager; only the simplest way is illustrated here. .lp This section uses an example application, \*(lqproducer-consumer\*(rq. The source code for the application programs is included in the Storage Manager software release, along with other example applications. .lp The producer program generates a series of transactions, each of which creates an object. The consumer program generates a series of transactions, each of which reads an object and destroys it. These programs were selected because they are relatively small, demonstrate the use of transactions, and show how to respond to server-initiated transaction failures and server failures. .lp The remainder of this section gives specific directions for starting a server and running the example program. Detailed explanations of the steps are not given here; all the details are given elsewhere in this document. .lp Installing the storage manager is akin to installing an operating system or a remote file system (but it's much simpler). You need to: .np install the system's executable code, libraries, and include files; .np prepare your disks for use; .np configure your server so that it will use your disks, and so that it is otherwise tailored for your use; .np compile and link your application programs to use the installed system; .np configure your application programs' environment, run the programs, and .np when you are finished, shut the system down. .br .sh 3 "Files Needed" .lp The following files are needed to use the Storage Manager: .np \fClibsm_client.a\fR, the Storage Manager client library, .np \fCsm_client.h\fR, the include file containing declarations of key data structures and constants, .np \fCsm_server\fR, the executable file for the server portion of the Storage Manager, .np \fCdiskrw\fR, the executable file for the disk I/O processes used by the server process, .np \fCformatvol\fR, and a utility program for formatting volumes, .np \&\fC.sm_config\fR, configuration files for a server, the formatter, and the application programs. One configuration file can be used for all programs, but it is sometimes easier to use configuration file for servers and the formatter, and another for applications. .lp These files can be installed anywhere; for the purpose of this section, we assume that they are all installed in your home directory, along with your application programs. (See the \fIEXODUS Storage Manager Installation Manual\fR to find the files in the Storage Manager software release.) .br .sh 3 "Preparing Your Disks" .lp The producer and consumer programs use a volume for storing their objects with a single server, and the server uses a log volume. The \fCformatvol\fR program is used to format a volume for use as either a data volume or a log volume. If you plan to use a raw disk partition for either volume, ask your system administrator for information on how to set up the device. .lp The formats of the volumes must be described in the configuration file that \fCformatvol\fR reads. In the directory in which you plan to run \fCformatvol\fR, create a file called \fC.sm_config\fR that looks something like this, with the appropriate substitutions: .(b .nf \fC formatvol*logformat: /path/to/logfile: 9000: 1: 1: 1000: 8 formatvol*dataformat: /path/to/datafile: 8000: 1: 1: 300 \fR .)b .lp Substitute the pathnames for files that you want to use for your log volume and data volume. With the options given above, the log volume will be given a volume identifier of 9000, and will consist of 1 cylinder of 1 track each, with 1000 blocks on each track, hence, 1000 blocks will be on the log. The log volume will use 8 Kbyte log pages. The data volume will be given a volume identifier of 8000, and will consist of 1 cylinder of 1 track each, with 300 blocks on each track, hence, 300 blocks will be on the data volume. .lp Now, run the formatter on volumes 9000 and 8000: .(b \fCformatvol -vol 9000 -vol 8000\fR .)b .lp If you would like to see the information written on the volumes' headers, do this: .(b \fCformatvol -dis 9000 -dis 8000\fR .)b .lp The formatter prints: .(b M \fCVOLID 9000, version 3, is a LOG volume BLOCK SIZES: 8 K slotted, 8 K lg data, 8 K lg hdr 8 K btree, 8 K idesc LAYOUT: 1000 blk/trk; 1 trk/cyl; 1 cyl 1000 total blocks of 8 KB for 8192.000 KB FREE: 0 free, 1000 used BITMAP: 1 blk each, freemap @ 2, slotmap @ 4, filemap @ 5 UNIQUE: start @ 3 LOG: start @ 7, ctl blk @ 6, blk sz 8 K, #blks 993 end of log @ dismount: LSN w=0.o=0, LRC w=0.c=1 VOLID 8000, version 3, is a DATA volume BLOCK SIZES: 8 K slotted, 8 K lg data, 8 K lg hdr 8 K btree, 8 K idesc LAYOUT: 300 blk/trk; 1 trk/cyl; 1 cyl 300 total blocks of 8 KB for 2457.600 KB FREE: 294 free, 6 used BITMAP: 1 blk each, freemap @ 2, slotmap @ 4, filemap @ 5 UNIQUE: start @ 3 \fR .)b .lp Now that you have formatted a log volume and a data volume, you are ready to start a server. .br .sh 3 "Configuring a Server" .lp Before you start a server, you need to create its configuration file. In the directory in which you expect to run the server, create a file called \fC.sm_config\fR that looks something like this, with the appropriate substitutions (in particular, for each occurrence of \fC/path/to\fR below): .(b .nf \fC server*bufpages: 500 # Portname need not be identical to log volume id. # This is just a convenience. server*portname: 9000 server*diskproc: /path/to/diskrw server*logformat: /path/to/logfile: 9000: 1: 1: 1000: 8 server*dataformat: /path/to/datafile: 8000: 1: 1: 500 server*logvolume: 9000 \fR .)b .lp If the same configuration file is to be used for the formatter and the server, the format options can be made to be recognized by both: .(b \fC [sf]*[rl].logformat: /path/to/logfile: 9000: 1: 1: 1000: 8 [sf]*[rl].dataformat:/path/to/datafile: 8000: 1: 1: 500 \fR .)b .lp Now you can start the server. Open a window in which to run the server, and, in the directory containing the server and its configuration file, start the server: .(b \fCsm_server\fR .)b The server is started on a newly formatted log volume, so it automatically regenerates the log. The server prints .(b \fCServer is ready for requests.\fR .)b when it can serve applications. .br .sh 3 "Compiling and Linking Your Application" .lp An application program must include the header file \fCsm_client.h\fR, which, in turn includes \fC<stdio.h>\fR, \fC<setjmp.h>\fR, \fC<sys/types.h>\fR, \fC<netinet/in.h>\fR. Applications can be compiled with a C or C++ compiler. .lp The client library is compiled with C++, so client programs must be linked with a C++ compiler. See the \fIEXODUS Storage Manager Installation Manual\fR for more information. .br .sh 3 "Configuring and Running Your Application" .lp The programs need configuration options to determine where to find the server that manages the data volumes they use, and to determine the sizes of the buffer pools they will use. In the directory in which you expect to run the application programs, Create a file called \fC.sm_config\fR that looks something like this, with the appropriate substitutions: .(b .nf \fC # both producer and consumer will use # 250 page buffer pools: client*bufpages: 250 # substitute the name or Internet address # of the host on which the server runs: client*mount: 8000 9000@serverhost \fR .)b Now you can run the producer and the consumer. It is easiest to create a window in which to run each program. The produce and consumer programs use the environment variable EVOLID to determine the what volume to use. EVOLID must be set in each window. .lp In window P: .(b \fC# producer <name> <#objects> <object size> setenv EVOLID 8000 producer P 100 1000\fR .)b In window C: .(b .sp 1 \fC# consumer <name> <#objects> setenv EVOLID 8000 consumer C 100\fR .)b .lp The producer creates \*(lq#objects\*(rq objects and writes \*(lqname\*(rq in each one. The \*(lqobject size\*(rq argument is the size of each object. The consumer reads and destroys \*(lq#objects\*(rq objects. It prints the sizes of the objects and their names. The \*(lqname\*(rq given to the consumer program is immaterial, but is helpful for reading the output when running more than one consumer. .lp The two programs use a single root entry and a single file on the given volume. When a consumer has consumed the last object in a file, it destroys the file and removes the root entry. Each object is produced or consumed in a separate transaction. When both a producer and consumer are running concurrently, deadlocks occur periodically, since both are reading and writing the same file. When a deadlock occurs, the offending program aborts its transaction and tries again. Multiple producer and consumer programs may be started. If the server fails or shuts down, the producer and consumer programs attempt to reconnect every five seconds, and when successful, they continue transaction processing. .br .sh 3 "Shutting Down the Server" .lp In the window in which the server runs, type the command: .(b \fCshutdown\fR .)b .lp The server prints various messages, among them .(b \fC Clean shutdown: no recovery required on any volumes. All disk processes killed. \fR .)b when recovery is not required. .bp .sh 1 "CONFIGURATION OPTIONS AND CONFIGURATION FILES" .lp The client library, servers, and administrative programs use configuration options. All the options have a string name, a type, a set of possible values, a default value, and a current value. Client options can be set by a call to an application interface function or by a line in a \fIconfiguration file\fR. .(x z configuration options .)x \*($n .(x z configuration file .)x \*($n Server options can be set on the command line or by a line in the server's configuration file. .lp Configuration files are Unix files, and are similar in format to the X Window system's resource files. Each line in a configuration file is an \fIoption command\fR or a \fIcomment\fR. .lp A comment is a line that begins with \*(lq#\*(rq or with \*(lq!\*(rq. .lp An option command is a line containing an \fIoption descriptor\fR, white space, and a string representing a value to assign to the option. An option descriptor consists of an \fIoption prefix\fR followed immediately by an option name and a \*(lq:\*(rq. .lp The option prefix specifies the type and name of the program or programs for which the option is to be set. The program type is one of \*(lqclient\*(rq, \*(lqserver\*(rq, and \*(lqformatvol\*(rq. The program name is usually the file name of the program, without its path (an application program can override this). The program type and program name are separated by \*(lq.\*(rq. For example, the complete option descriptor for the option \*(lqbufpages\*(rq on the server named \fCserverA\fR is \fCserver.serverA.bufpages:\fR. .lp Wild card characters are allowed in the program type and name. The character \*(lq*\*(rq represents any portion of the prefix. The \*(lq?\*(rq character represents any program type or any program name. The expressions describing the program type and the program name are parsed by a regular expression handler, so complex expressions can be used. See the manual page for regex(3). .lp The names of options can be abbreviated, as long as the abbreviation unambiguously identifies a single option. (This is also true for options appearing on command lines.) Program types and names may not be abbreviated. Option name, program type, and program name matches are case-sensitive. .lp Configuration options of type Boolean can be set with the Boolean values TRUE or FALSE, or with the strings \*(lqyes\*(rq, \*(lqtrue\*(rq, \*(lqno\*(rq or \*(lqfalse\*(rq. The strings may be abbreviated and are not case-sensitive. .lp Each setting of an option overrides any previous value for that option. .lp Below, excerpts from configuration files illustrate ways to use the options. .(b I \fC # log volumes for two servers, whose executable # file names are serverA and serverB server.serverA.logvolume: 1000 server.serverB.logvolume: 2000 \fR .)b .(b I \fC # turn off progress printing for all servers server*progress: no # or server.?.progress: no \fR .)b .(b I \fC ! all servers and clients have a 1000 page buffer pool *bufpages: 1000 # The application foo uses a 500 page buffer pool. # (overriding the value of 1000, above) client.foo.bufpages: 500 # Applications beginning with the letter g use 400 pages client.g*.bufpages: 400 \fR .)b .bp .sh 1 "THE STORAGE MANAGER APPLICATION INTERFACE" .lp The Storage Manager's application interface consists of a set of functions, macros, and variables. The Storage Manager software release contains the header file \fCsm_client.h\fR, in which are found the definitions for the macros and types that appear in this document. Function prototypes for the the Storage Manager functions are also found in \fCsm_client.h\fR. By convention, words that appear capitalized in the text are either C-preprocessor macros, or C- or C++- defined types, Functions definitions appear in bold face in the text. The rest of this section is divided into sub-sections describing error handling, initialization and shutdown, transactions, buffer management, operations on objects, operations on versions, operations on files, operations on indexes, miscellaneous macros, and administrative functions. .br .sh 2 "Handling Errors" .lp Error handling is important to users wishing to write robust client applications. We discuss it first, since most Storage Manager functions return error codes. Although this issue is complex, some of the burden is lightened by the recovery facilities of the Storage Manager. In this section we focus on error codes and error messages. .lp Almost all Storage Manager functions have integer return codes. .(x z error return codes, sm_errno .)x \*($n .(x z sm_errno .)x \*($n All functions (except those used in printing error messages) return either esmNOERROR (zero), which represents success, or esmFAILURE (negative one), which represents an error. When an error occurs, the global variable sm_errno contains an error code. A small positive error code is an error code returned by Unix, as defined in \fC<errno.h>\fR. An error code greater than 65,536 is an error returned by the Storage Manager, as defined in \fCsm_client.h\fR. The Storage Manager error codes have symbolic names (C preprocessor macros) that begin with \fIesm\fR. \fBThe value of sm_errno is not defined when the function returns esmNOERROR.\fR .lp Information about error codes can be obtained from the functions sm_Error(\ ), and sm_ErrorId(\ ), which are discussed below. .lp Some errors cause a message to be printed to the file addressed by \fCsm_ErrorStream\fR. By default, .(x z default error file for messages .)x \*($n this file is the standard error file, stderr, as defined in \fC<stdio.h>\fR, but the application can change it any time after the Storage Manager is initialized. .lp Errors differ in severity and have different side effects. The most severe errors are fatal and cause the application to exit (the client library calls \fIexit(3)\fR). When the application exits, the servers abort the transaction, if a transaction is active. Fatal errors are caused by internal software problems in the Storage Manager. An example of a fatal error is esmMALLOCFAILED, which occurs when the entire data segment has been allocated by the application and client library, and the Storage Manager cannot proceed. .lp Less severe errors cause the transaction to be aborted, but leave the application running. When this happens, sm_errno is given the value esmTRANSABORTED, .(x z esmTRANSABORTED .)x \*($n .(x z transaction aborted .)x \*($n and the client library also sets the global variable \fCsm_reason\fR. .(x z sm_reason .)x \*($n .(x z error return codes, sm_reason .)x \*($n The range of values for \fCsm_reason\fR is the same as the range of values for sm_errno. (The value of \fCsm_reason\fR is meaningful only if sm_errno has the value esmTRANSABORTED, and it is unpredictable and meaningless otherwise.) When the server or the client library aborts a transaction and returns esmTRANSABORTED to the application, the transaction is only partially aborted. The application \fBmust\fR complete the termination of the transaction by calling sm_AbortTransaction(\ ) (described in the Section 4.3.3, \fBTransaction Operations\fR). .lp Less severe errors are generated by incorrect arguments to client interface functions or the lack of resources, such as buffer space. The application can correct the problem and retry the operation without aborting the transaction. .lp Finally, some error codes indicate conditions that are not errors at all, such as esmEMPTYFILE, which is returned when an empty file is read. .lp The following two functions can be used to print more information about the error. .sp .(b L \fBchar *sm_Error (errorCode) int errorCode; /* error code returned by an sm function /*\fR .)b .(b L \fBchar *sm_ErrorId (errorCode) int errorCode; /* error code returned by an sm function /*\fR .)b .lp These are the only Storage Manager functions that do not return an integer. When a client library function returns an error, sm_Error(\ ) can be called by the application to get a string that provides a brief description of the error. It also provides descriptions of Unix error codes. Sm_ErrorId(\ ) is used to return the string representation of the error code. For example, the call sm_ErrorId(esmBADOID) returns the string \*(lqesmBADOID\*(rq, and the call sm_Error(esmBADOID) returns the string \*(lqinvalid object id.\*(rq .lp If the client is disconnected from a server (by a server crash, network failure, etc.) the client library tries to reconnect to the server the next time it issues a request to the server. If the server in question is not available, the Storage Manager returns an error such as esmSERVERDIED or a Unix error such as ECONNREFUSED. While the server in question is doing recovery after a restart, esmTRANSDISABLED is returned. The server responds to requests when recovery is completed. .br .sh 2 "Initialization and Shutdown Operations" .lp Initialization and shutdown functions are used at the beginning and end of an application program, but most of them can be called at any time. The pertinent functions are sm_SetClientOption(\ ), sm_GetClientOption(\ ), sm_ParseCommandLine(\ ), sm_ReadConfigFile(\ ), sm_Initialize(\ ), and sm_ShutDown(\ ). .lp Before initializing the Storage Manager client with sm_Initialize(\ ), a number of client configuration options must be set by the application. .(x z configuration options .)x \*($n Options can be set through calls to sm_SetClientOption(\ ), sm_ParseCommandLine(\ ), or sm_ReadConfigFile(\ ). These options are summarized in Table 1. See Section 3 for information that applies to all options. .(b .TS box, center, tab(;); c|c|c|c|c c|c|c|c|c l|l|l|l|l. Option;Option;Possible;Default;Option Name;Type;Values;Values;Description _ bufpages;int;> 4;none;# pages in the buffer pool groups;int;> 3;20;# buffer groups userdesc;int;> 0;2000;# user descriptors mount;string;volid port@host;none;where to find server ;;;;for this volume lognewpages;Boolean;yes,no,true,false;no/false;client logs new pages deallocpages;Boolean;yes,no,true,false;yes/true;removes empty pages pagelock;string;SH,EX;SH;default lock for pages traceflags;int;>= 0;0;set tracing flags locktimeout;int;>= 0;30;# 10-second intervals ;;;;willing to await a lock .TE .ce .uh "Table 1: Client Options" .(x z options, client .)x \*($n .)b .lp The \*(lqbufpages\*(rq option sets the size of the client buffer pool in 8 Kbyte pages (or \fIn\fR byte pages, for \fIn\fR=MIN_PAGESIZE; MIN_PAGESIZE is defined in \fCsm_client.h\fR). See Section 4.11.3, \fBTuning the Application\fR for more information about setting this option. .lp The \*(lqgroups\*(rq option sets the limit on the number of buffer groups that can be opened at once. The default value is 20. See Section 4.6, \fBBuffer Operations\fR, for more information about buffer groups. .lp The \*(lquserdescs\*(rq option sets the limit on the number of open user descriptors. The number of user descriptors should be set to the maximum number of simultaneous object references that are expected by the application program. The default value is 2000. See Section 4.7, \fBOperations on Objects\fR, for more information about user descriptors. .lp The \*(lqlognewpages\*(rq option, if \*(lqyes\*(rq, causes the client to generate log pages for newly allocated pages, and if \*(lqno\*(rq, causes the server to generate the log pages. Setting this option to \*(lqno\*(rq results in fewer log records shipped to servers and usually lowers log space requirements for transactions that create objects. With rare patterns of use, setting \*(lqlognewpages\*(rq to \*(lqyes\*(rq results in better performance: if the objects that cause new pages to be allocated are small, and if enough work is done between object-creation operations to cause the newly allocated pages to be swapped, the preferred value for \*(lqlognewpages\*(rq is \*(lqyes\*(rq. In general, it is difficult to predict which objects will be be created on newly allocated pages. The \*(lqlognewpages\*(rq option may be set only when a transaction is not active. .lp The \*(lqdeallocpages\*(rq option, if \*(lqyes\*(rq, causes the client to deallocate pages that become empty after objects are destroyed. If the option's value is \*(lqno\*(rq, these pages remain in the file, and do not get used again unless an appropriate \fInear-hint\fR .(x z near-hint .)x \*($n .(x z hint, near- .)x \*($n is given when an object is subsequently created. Under most circumstances, the preferred value of \*(lqdeallocpages\*(rq is \*(lqyes\*(rq. If objects are created and destroyed in a LIFO fashion, and if the near-hint for object creation is NEAR_LAST, the preferred value is \*(lqno\*(rq. .lp The \*(lqpagelock\*(rq option changes the default lock mode for pages. See the Section 4.2, \fBInitialization and Shutdown Operations\fR, and Appendix A, \fBLocking Protocol for Storage Manager Operations\fR for information about using options. .lp The \*(lqtraceflags\*(rq option is used to turn on tracing, and is only available in a Storage Manager that was compiled with -DDEBUG. The \*(lqtraceflags\*(rq option takes effect immediately and can be set at any time. .lp The \*(lqmount\*(rq options indicate the locations of the volumes that the applications use. The \*(lqmount\*(rq option may be used more than once, to \fIadd\fR new volumes to the client library's set of usable volumes, or to \fIchange\fR the location of a volume. The option value consists of a volume's integer identifier, an Internet address, and a port at which can be found a server that manages the volume. The Internet addresses and port have format \fIport @ host\fR, where both the port and the host can be numeric or symbolic. Symbolic port names must be found in the services database used by \fIgetservbyname(3n)\fR, and symbolic host names must be in the host name database used by \fIgethostbyname(3n)\fR. The following example shows three values for the \*(lqmount\*(rq option that accomplish the same thing in three ways. The volume 1000 is managed by the server listening on port 1152 (which is called \*(lqbounty\*(rq in the \fC/etc/services\fR database) on the local machine, whose Internet address is 128.105.2.153, also known as \*(lqpitcairn.isle.edu\*(rq to the host-name server. .(l \fC1000 1152@128.105.2.153\fR \fC1000 bounty@pitcairn.isle.edu\fR \fC1000 1152@pitcairn.isle.edu\fR. and \fC1000 bounty@128.105.2.153\fR .)l .lp The host name \fIlocalhost\fR \fBdoes not work\fR if you are using distributed transactions (multiple cooperating servers). .lp Volume identifiers \fBmust identify volumes unambiguously, across all servers.\fR .lp For each application or client, \fBall the host names used for a given server must resolve to the same Internet address\fR. Using the above example, this means that \*(lq128.105.2.153\*(rq and \*(lqpitcairn.isle.edu\*(rq are interchangeable. \*(lqLocalhost\*(rq, which resolves to the Internet address 127.0.0.1, is not interchangeable with \*(lq128.105.2.153\*(rq or \*(lqpitcairn.isle.edu\*(rq, even though it addresses the same machine when used by a client on \*(lqpitcairn.isle.edu\*(rq. .lp It is acceptable to use two \fIdifferent\fR servers running on a machine, by addressing them at different \fIports\fR. This means that .(l \fC1000 1151@pitcairn.isle.edu\fR and \fC2000 1152@pitcairn.isle.edu\fR .)l can serve an application. .lp The \*(lqlocktimeout\*(rq option limits the time the server waits to acquire a lock on behalf of the client. The value represents a number of 10-second intervals. A value of zero means that the server does not wait at all, and if the lock cannot be acquired immediately, the client operation returns esmFAILURE, with esmLOCKBUSY in sm_errno. The option value can be changed at any time. The value that is in effect at the time a transaction makes its first request to a server is the value used for lock requests on that server for the duration of the transaction. See Appendix A, Section A.3, \fBDeadlock Detection and Avoidance\fR, for more information about locks. See also Section 4.4, \fBMounting and Dismounting Volumes\fR, for information concerning the protocol between clients and servers. .lp To support code that was written before the configuration option .(x z configuration options .)x \*($n facility was added, the client library looks for the environment variable ESMCONFIG. If set, ESMCONFIG indicates a configuration file to read. .(x z configuration file .)x \*($n The file is read using sm_ReadConfigFile(\ ), with its \*(lqprogramName\*(rq argument having the value NULL. It is read before any option is set, so all other functions that set options override those found in the ESMCONFIG file. .sp .(b L \fBsm_SetClientOption (optionName, optionValue, valueType) char *optionName; /* IN name of the option to set */ void *optionValue; /* IN new value for the option */ SMDATATYPE valueType; /* IN type of optionValue */\fR .)b .(x z sm_SetClientOption .)x \*($n Sm_SetClientOption(\ ) sets the option named \*(lqoptionName\*(rq to the value in \*(lqoptionValue\*(rq. The \*(lqvalueType\*(rq arguments indicates the type addressed by \*(lqoptionValue\*(rq. The supported types are SM_int, SM_Boolean, and SM_string. If \*(lqvalueType\*(rq matches the type of the option as specified in Table 1, a simple assignment is done. If \*(lqvalueType\*(rq is SM_string and the option has a different type, a conversion is performed. .sp .(b L \fBsm_GetClientOption (optionName, optionValue) char *optionName; /* IN name of the option to get */ void *optionValue; /* OUT value for the option */\fR .)b .(x z sm_GetClientOption .)x \*($n Sm_GetClientOption(\ ) retrieves the value for \*(lqoptionName\*(rq and returns it in \*(lqoptionValue\*(rq. It is assumed that the location addressed by \*(lqoptionValue\*(rq matches the type, found in Table 1, for the option. For string-type options, the argument \*(lqoptionValue\*(rq is treated as type \*(lqconst char **\(*rq. That is, it should contain the address of a pointer variable that is updated to point to a read-only buffer containing the option value. .sp .(b L \fBsm_ParseCommandLine (argc, argv, errorMsg) int *argc; /* IN/OUT number of command line arguments */ char **argv; /* IN/OUT command line arguments */ char **errorMsg; /* OUT syntax error message */\fR .)b .(x z sm_ParseCommandLine .)x \*($n Sm_ParseCommandLine(\ ) searches the command line, \*(lqargv\*(rq, for any client options. Command-line options are prefixed by a \*(lq-\*(rq. The value for the option must follow the option name. The Storage Manager ignores any command-line argument that is not recognized as a Storage Manager client option. If a client option is found, the name and value are removed from \*(lqargv\*(rq and \*(lqargc\*(rq is decremented by 2, even if there is an error in the option such as being given an illegal value. If there is an error processing any option, \*(lqerrorMsg\*(rq is changed to point to an error message string. .sp .(b L \fBsm_ReadConfigFile (configFile, programName, errorMsg) char *configFile; /* IN name of the configuration file */ char *programName; /* IN name of the application */ char **errorMsg; /* OUT syntax error message */\fR .)b .lp Sm_ReadConfigFile(\ ) reads the option configuration file .(x z sm_ReadConfigFile .)x \*($n \*(lqconfigFile\*(rq, and sets the options indicated. If \*(lqconfigFile\*(rq is NULL, the default configuration files \fC/usr/lib/exodus/sm_config\fR, \fC$HOME/.sm_config\fR, and \fC./.sm_config \fR are read in that order, if they exist. The name of the default configuration file \fC/usr/lib/exodus/sm_config\fR .(x z default configuration files .)x \*($n .(x z configuration files, default .)x \*($n can be changed with a minor Storage Manager source code change described in the installation manual, \fIEXODUS Storage Manager Installation Manual\fR. The \*(lqprogramName\*(rq option gives the program name for matching with options in the configuration file. If \*(lqprogramName\*(rq is NULL and a previous call to sm_ReadConfigFile(\ ) had a non-NULL \*(lqprogramName\*(rq, the previous \*(lqprogramName\*(rq is used. If no previous call was made and a \*(lqprogramName\*(rq is not given, configuration file lines that contain a program name are not used; only generic entries, such as \fCclient.bufpages: 1000\fR and \fCclient*bufpages: 1000\fR are used. .lp When an error occurs while reading the file, an error message is stored in \*(lqerrorMsg\*(rq and esmFAILURE is returned, as with other Storage Manager functions. The \*(lqerrorMsg\*(rq is describes syntax related errors in the configuration file. .lp See Section 3 for information about the format of configuration files. .sp .(b L \fBsm_Initialize (\ ) .)b .(x z sm_Initialize(\ ) .)x \*($n Sm_Initialize(\ ) initializes the Storage Manager's data structures. No Storage Manager functions except option and configuration file functions may be called .(x z configuration file .)x \*($n before sm_Initialize(\ ) is called. Options that do not have defaults must be set before sm_Initialize(\ ) is called, otherwise esmFAILURE is returned, sm_errno is set to indicate what the problem is. .sp .(b L \fBsm_ShutDown (\ )\fR .)b Sm_ShutDown(\ ) .(x z sm_ShutDown(\ ) .)x \*($n closes all the open buffer groups and frees the memory allocated at run-time by the client library. Once the client library has been shut down, it can used again by calling sm_Initialize(\ ). The client library loses track the information in the \*(lqmount\*(rq client options, so if sm_Initialize(\ ) is to be used again, the configuration files must be reread or the mount options must be reset with sm_SetClientOption(\ ). .lp Figure 2 shows a simple \*(lqhello world\*(rq application for the Storage Manager. It sets configuration options, initializes the client library, .(x z configuration options .)x \*($n and shuts down the client library. A more complete program would, begin transactions, perform operations on objects, files, and indexes. More sample programs are included with the software release. .(z I .sz -3 \fC/* * "Hello world" program: demonstrates initialization and shutdown. */ #include <stdlib.h> #include "sm_client.h" void ErrorCheck (int, char *); main(int argc, char** argv) { int e; char *errorMsg; e = sm_ReadConfigFile(NULL, argv[0], &errorMsg); if (e != esmNOERROR) { fprintf(stderr, "Configuration file error: %s", errorMsg); ErrorCheck(e, "sm_ReadConfigFile"); exit(0); } e = sm_ParseCommandLine(&argc, argv, &errorMsg); if (e != esmNOERROR) { fprintf(stderr, "Command line error: %s", errorMsg); ErrorCheck(e, "sm_ParseCommandLine"); exit(0); } e = sm_Initialize(\ ); ErrorCheck(e, "sm_Initialize"); printf("Hello world!"); e = sm_ShutDown(\ ); ErrorCheck(e, "sm_ShutDown"); } void ErrorCheck (int e, char *func) { if (e < 0) { fprintf(stderr, "Storage Manager error \e"%s\e" in %s", sm_Error(sm_errno), func); exit(1); } }\fR .sz +3 .ce .uh "Figure 2: Example Program" .)z .br .sh 2 "Transactions" .lp The Storage Manager supports transactions, including concurrency control and recovery. Transactions may involve data managed by several Exodus Storage Manager servers, in which case a two-phase commit protocol, based on .(x z presumed abort .)x \*($n Presumed Abort [Moha83], determines the fate of the transaction when the application commits the transaction. The fact that such a transaction is distributed over several servers .(x z transactions, distributed .)x \*($n .(x z distributed transactions .)x \*($n is invisible to the application. On the other hand, the Storage Manager (server or servers) can cooperate in a two-phase commit procedure with other transaction processing systems when the external two-phase commit functions are used. The external two-phase commit functions also can be used explicitly to invoke the two phases for a transaction that involves only Exodus Storage manager servers. The external two-phase commit functions are described under \*(lqAdvanced Topics\*(rq, in Section 4.11.3, \fBExternal Two-Phase Commit Functions\fR, .lp Object, file, index, and root entry operations must be performed within the scope of a transaction, or an error is returned. An application can run no more than one transaction at a time. Transactions cannot be nested, suspended, or resumed. .lp In order to guarantee the semantics of transactions, operations on objects and files acquire \fIlocks\fR. .(x z locks .)x \*($n Appendix A describes the kinds of locks acquired by the client library functions. .br .sh 3 "Transaction Identifiers" .lp Each transaction has a local transaction identifier, which is assigned by the Storage Manager. The data type TID represents a transaction identifier. .(x z transaction identifier .)x \*($n .(x z transaction identifier, local .)x \*($n The application can treat a TID as an opaque value. The Storage Manager maintains a global variable, Tid, of type TID, which value the application can inspect, but had better not modify. .lp The application can use the following two macros to give an initial value to a transaction identifier, and to recognize that value. .(b I \fBINVALIDATE_TID (TID tid)\fR .)b .lp sets the \*(lqtid\*(rq argument to an invalid transaction identifier. .(b I \fBTID_IS_INVALID (TID tid)\fR .)b .lp returns TRUE if \*(lqtid\*(rq is the value given by INVALIDATE_TID(\ ), FALSE if not. TID_IS_INVALID(\ ) does not tell if there is an active transaction with the given transaction identifier. .br .sh 3 "Transaction States" .lp An application is always in one the following states: not running a transaction (INACTIVE), running a transaction (ACTIVE), running a transaction that has been (partially) aborted (ABORTED). .lp An application is in the INACTIVE state until it calls sm_BeginTransaction(\ ), and after a call to sm_CommitTransaction(\ ) or sm_AbortTransaction(\ ). .lp If the Storage Manager server or client library aborts a transaction, which sometimes happens because of an error on the part of the application, the application is in the ABORTED state until a call to sm_AbortTransaction(\ ). While in the ABORTED state, a call to any function other than sm_AbortTransaction(\ ) returns the .(x z esmTRANSABORTED .)x \*($n error esmTRANSABORTED. .br .sh 3 "Transaction Operations" .sp .(b L \fBsm_BeginTransaction (tid) TID *tid; /* OUT transaction ID */\fR .)b .(x z sm_BeginTransaction(\ ) .)x \*($n Sm_BeginTransaction(\ ) is called at the beginning of a transaction. The argument \*(lqtid\*(rq corresponds to a transaction identifier and is assigned by the Storage Manager. .lp Sm_BeginTransaction(\ ) \fBdoes not\fR contact any servers or initiate a transaction with any server, since the operation has no arguments to indicate which servers are of interest. It only begins a transaction \*(lqlocally\*(rq. Once a transaction has begun locally, the client library initiates transactions on servers when data references so require. .sp .(b L \fBsm_CommitTransaction (tid) TID tid; /* IN transaction ID */\fR .)b .(x z sm_CommitTransaction(\ ) .)x \*($n Sm_CommitTransaction(\ ) is called to commit the effects of a transaction. If the commit succeeds, all changes made to data since the beginning of the transaction are guaranteed to be persistent, even in the event of system failure. See Section 4.9.1, \fBConsistency Guarantees for Files\fR, for more information about this guarantee. If the commit fails, an error is returned, and the transaction is aborted. When a transaction is committed, all user descriptors (see sm_ReadObject(\ ) ) are released. Buffer groups attached to the transaction (see sm_OpenBufferGroup(\ ) ) are closed. .sp .(b L \fBsm_AbortTransaction (tid) TID tid; /* IN transaction ID */\fR .)b .(x z sm_AbortTransaction(\ ) .)x \*($n Sm_AbortTransaction(\ ) aborts a transaction. Sm_AbortTransaction(\ ) releases all the user descriptors that were created during the transaction (see sm_ReadObject(\ ) ). Buffer groups attached to the transaction (see sm_OpenBufferGroup(\ ) ) are closed. .lp The persistent data appear as if the transaction never began. The execution state of the application program is not affected by calling sm_AbortTransaction(\ ). The result is that the transient data in the program's address space do not match the state of the persistent data. The problem can be alleviated to some degree by judicious use of \fIsetjmp(2)\fR, \fIlongjmp(2)\fR, and lexical scoping in the application program. The following macros, which are defined in \fCsm_client.h\fR, do that: .sp .(b L \fBSM_BEGIN_TRANSACTION (tid, abortCode) TID *tid; /* transaction ID */ int abortCode; /* location to store abort code */\fR .)b SM_BEGIN_TRANSACTION begins a transaction block (i.e. it opens a new lexical scope in C or C++). The transaction ID is placed in \*(lqtid\*(rq. The argument \*(lqabortCode\*(rq \fBmust\fR be a variable. This variable can be checked at the end of the transaction to determined if it was aborted. .sp .(b L \fBSM_COMMIT_TRANSACTION (tid) TID tid; /* transaction ID */\fR .)b SM_COMMIT_TRANSACTION ends a transaction block. When this statement is executed, the transaction is committed, assuming no error occurs during commit. Immediately after the SM_COMMIT_TRANSACTION statement, the \*(lqabortCode\*(rq variable given in the SM_BEGIN_TRANSACTION statement should be checked to see if any error occurred. If no error occurred, \*(lqabortCode\*(rq is set to esmNOERROR. Otherwise, \*(lqabortCode\*(rq is set to the value given in SM_ABORT_TRANSACTION. .sp .(b L \fBSM_ABORT_TRANSACTION (abortCode) int abortCode; /* error to return on abort */\fR .)b SM_ABORT_TRANSACTION aborts the active transaction (i.e. sm_AbortTransaction(\ ) is called) and resumes execution at the line immediately following the SM_COMMIT_TRANSACTION statement for the transaction. The SM_ABORT_TRANSACTION macro does not need to be called within the lexical scope of the transaction block. It can be called in any function operating in the dynamic scope of the transaction. The \*(lqabortCode\*(rq argument sets the \*(lqabortCode\*(rq variable given in SM_BEGIN_TRANSACTION. .lp When a SM_ABORT_TRANSACTION is called, the program's control is transferred to the program point after the SM_COMMIT_TRANSACTION statement. The stack pointer is restored to the level of the transaction block, so functions on the program's stack after it are not completed. \fBFor C++, this means that destructors are not called for any local variables in those functions.\fR .lp Examples of using both the transaction macros and functions can be found in the producer-consumer example given in the Storage Manager software release. .br .sh 2 "Mounting and Dismounting Volumes" .lp An application program \fBdoes not\fR need to mount and dismount volumes explicitly. In most cases, the client library automatically mounts a volume when the application makes its first reference to that volume. An application that does not explicitly mount a volume may, when it performs its first operation on an object, find that the server for that object is not running. Writing programs to handle such common errors can be difficult, so it may be more convenient to mount volumes before proceeding with operations on data. Sm_MountVolume(\ ) serves that purpose. If that server has not yet been contacted, sm_MountVolume(\ ) establishes a connection to the server and mounts the volume. It does not begin a transaction. (See Section 4.3.3, \fBTransaction Operations\fR to understand how transactions are begun.) .lp When an application exits or calls sm_ShutDown(\ ), connections to servers are severed, and the servers dismount the volumes used by the application. A server severs its connections and dismounts the volumes if an application is \fIinactive\fR for a significant time. An application is inactive if it has no transaction running. .lp An application can dismount volumes explicitly, causing the volumes to be dismounted at the server. An application that continues to run after it is finished using the Storage Manager would do well to use sm_ShutDown(\ ). If it is inappropriate to use sm_ShutDown(\ ), but such an application is finished with a set of volumes, it would do best to dismount the volumes, particularly if the volumes are likely to be reformatted. .sp .(b L \fBsm_MountVolume ( volid ) VOLID volid; /* IN volume to mount */ .)b .(x z sm_MountVolume(\ ) .)x \*($n .lp Sm_MountVolume(\ ) causes the volume identified by \*(lqvolid\*(rq to be mounted. A side effect of the operation is that the client library has established a connection with the server that manages this volume. .lp If the volume cannot be mounted, sm_MountVolume(\ ) returns esmFAILURE and a value in sm_errno that describes the reason: esmNOSUCHVOLUME (the client library cannot identify the server for this volume because there is no \*(lqmount\*(rq option for this volid), esmTRANSABORTED (the transaction was aborted during the previous operation, and the next thing the application must do is abort the transaction), esmSERVERDIED (connection with server was severed during the mount operation), or any Unix error message from \fC<errno.h>\fR (such as ENETDOWN and ECONNREFUSED), which indicate that the server is not running or is unreachable through the network. .sp .(b L \fBsm_DismountVolume ( volid ) VOLID volid; /* IN volume to dismount */ .)b .(x z sm_DismountVolume(\ ) .)x \*($n .lp The \*(lqvolid\*(rq argument identifies the volume to be dismounted. If the volume is not mounted, the operation returns esmFAILURE, and the client library returns esmBADVOLID in sm_errno. .sh 2 "Root Entries" .lp The root entry facility is designed for applications to get a handle to data on a volume. \** .(f \** Root entries cannot be created on temporary volumes. .)f A common use of a root entry is to associate a string name with an object identifier for an object containing information about the contents of the volume. For example, in a database system, this might be the object identifier for the catalog. .lp A root entry is a string and data pair stored in a special location on a volume, called the root area. The string, called the name, is used to identify the entry. The name string must be null-terminated. The maximum lengths of the name (including the terminating null) and data are defined by MAX_ROOTNAME_SIZE and MAX_ROOTDATA_SIZE respectively. An error is returned if the available number of root entries is exceeded. Names and data are limited to 32 bytes each, and approximately 90 root entries can reside in a volume's root area. .sp .(b L \fBsm_SetRootEntry (volid, name, data, dataLength) VOLID volid; /* IN volume identifier */ char *name; /* IN name to store data entry under */ void *data; /* IN data entry to be stored */ int dataLength; /* IN length of the data */ .)b .(x z sm_SetRootEntry(\ ) .)x \*($n .lp Sm_SetRootEntry(\ ) is creates or updates an entry. The \*(lqname\*(rq argument is the name of the entry and the \*(lqdata\*(rq argument is the data to be stored. The number of bytes in the data is given in \*(lqdataLength\*(rq. For example, to store the contents of the variable \*(lqrootOid\*(rq under the name \*(lqroot-obj\*(rq, use \fCsm_SetRootEntry(volid, \*(lqroot-obj\*(rq, (char*) &rootOid, sizeof(rootOid))\fR. .lp Sm_SetRootEntry(\ ) obtains an exclusive .(x z lock, exclusive .)x \*($n lock on the root area of the volume, so updates to root entries should be performed in a short transaction. .sp .(b L \fBsm_GetRootEntry (volid, name, data, dataLength) VOLID volid; /* IN volume identifier */ char *name; /* IN name of the entry */ void *data; /* OUT data stored under name */ int *dataLength; /* IN/OUT length of the data */ .)b .(x z sm_GetRootEntry(\ ) .)x \*($n Sm_GetRootEntry(\ ) retrieves the root entry named \*(lqname\*(rq. The data is placed in \*(lqdata\*(rq and the length of the data is returned in \*(lqdataLength\*(rq. If \*(lqdataLength\*(rq is initialized with a value greater than or equal to zero, the maximum number of bytes copied to \*(lqdata\*(rq is \*(lqdataLength\*(rq. If \*(lqdataLength\*(rq is initialized with a value less than zero, the entire length of the data is copied to \*(lqdata\*(rq. .lp Sm_GetRootEntry(\ ) obtains a share lock on the root area of the volume. This share lock blocks other .(x z lock, share .)x \*($n transactions from updating or removing root entries until the transaction is committed or aborted. If no root entry exists for \*(lqname\*(rq, esmFAILURE is returned and sm_errno is set to esmBADROOTNAME. .sp .(b L \fBsm_RemoveRootEntry (volid, name) VOLID volid; /* IN volume identifier */ char *name; /* IN name of entry */ .)b .(x z sm_RemoveRootEntry(\ ) .)x \*($n .lp Sm_RemoveRootEntry(\ ) removes the root entry stored under \*(lqname\*(rq. Sm_RemoveRootEntry(\ ) obtains an exclusive .(x z lock, exclusive .)x \*($n lock on the root area of the volume, so removal of root entries should be performed in a short transaction. .br .sh 2 "Buffer Operations" .lp The Storage Manager buffer manager implements the concept of a \fIbuffer group\fR, as proposed in the DBMIN buffer management .(x z buffer group .)x \*($n algorithm [Chou85]. The essence of the DBMIN algorithm is that competing uses of the buffer pool may be allocated their own buffers, to minimize competition for the buffers and to eliminate thrashing in the buffer pool. .lp All uses of the buffer pool are made through a buffer group. A buffer group is a container of page buffers, with a limit on the number of \fIfixed\fR pages it can contain. .(x z pages, fixed .)x \*($n .(x z fixed pages .)x \*($n Fixed pages are guaranteed to remain in the buffer pool until they are \fIunfixed\fR. .(x z unfixed pages .)x \*($n .(x z pages, unfixed .)x \*($n Their locations (virtual addresses) may change, but the pages remain in the virtual address space of the buffer pool. Each buffer group has a replacement policy, which controls the replacement of unfixed pages within the buffer group. .lp Buffer groups can be opened and closed at any time, whether or not a transaction is running. If a buffer group is opened in a transaction, it may be \*(lqattached\*(rq to the transaction, which means that the buffer group is closed by the client library when the transaction ends. An attached buffer group can be closed explicitly by the application before the transaction ends. .lp The following two macros can be used with buffer groups to give an initial value to a buffer group index and to recognize that value. .(b I \fBINVALIDATE_BUFGROUP (int bufgroup)\fR .)b .lp sets the \*(lqbufgroup\*(rq argument to an invalid buffer group index. .(b I \fBBUFGROUP_IS_INVALID (int bufgroup)\fR .)b .lp returns TRUE if \*(lqbufgroup\*(rq is the value given by INVALIDATE_BUFGROUP(\ ), FALSE if it is not. BUFGROUP_IS_INVALID(\ ) does not tell if there exists a buffer group with the given index. .sp .(b L \fBsm_OpenBufferGroup (groupSize, policy, groupIndex, flags) int groupSize; /* IN the maximum group size in pages */ int policy; /* IN the group's replacement policy */ int *groupIndex; /* OUT the group's index */ FLAGS flags; /* IN buffer group attributes */\fR .)b .(x z sm_OpenBufferGroup(\ ) .)x \*($n .lp Sm_OpenBufferGroup(\ ) opens a new buffer group. The \*(lqgroupSize\*(rq argument specifies the size of the buffer group in MIN_PAGESIZE pages. The sum of the sizes of all open buffer groups cannot exceed the size of the buffer pool. (See Section 4.11.3, \fBTuning the Application\fR.) The choice for \*(lqpolicy\*(rq is least-recently-used (BF_LRU) or most-recently-used (BF_MRU). BF_LRU and BF_MRU are defined in \fCsm_client.h\fR. The argument \*(lqgroupIndex\*(rq is filled by the Storage Manager and must be used in subsequent references to the buffer group. (All operations on files and objects require a buffer group index.) .lp The \*(lqflags\*(rq indicates whether the buffer group is to be associated with a transaction. NOFLAGS indicates that it is not. TRANS_GROUP indicates that the buffer group is associated with the current transaction. The group is closed by the client library when the active transaction ends. If TRANS_GROUP is used, a transaction must be running at the time sm_OpenBufferGroup(\ ) is called. .lp The effect of sm_OpenBufferGroup(\ ) is to reserve \*(lqgroupSize\*(rq pages in the client's buffer pool. No buffer group is opened on the server. .sp .(b L \fBsm_BufferGroupInfo (groupIndex, maxPages, fixedPages, unfixedPages) int groupIndex; /* IN the group to inspect */ int *maxPages; /* OUT max fixed pages allowed */ int *fixedPages; /* OUT current # of pages fixed */ int *unfixedPages; /* OUT current # of pages unfixed */\fR .)b .(x z sm_BufferGroupInfo(\ ) .)x \*($n .lp Sm_BufferGroupInfo(\ ) returns information about the open buffer group identified by \*(lqgroupIndex\*(rq. The function returns the buffer group's size limit in pages in \*(lqmaxPages\*(rq. In \*(lqfixedPages\*(rq, it returns the number of pages currently fixed in the buffer group. See the next section for more information about these functions. The argument \*(lqunfixedPages\*(rq refers to all buffer pages that belong to the buffer group, but are not fixed, that is these pages may be removed from the buffer pool if space is needed for fixed pages. .sp .(b L \fBsm_CloseBufferGroup (groupIndex) int groupIndex; /* IN the group being closed */\fR .)b .(x z sm_CloseBufferGroup(\ ) .)x \*($n Sm_CloseBufferGroup(\ ) closes the open buffer group identified by \*(lqgroupIndex\*(rq. \" ********************************************************************** .br .sh 2 "Operations on Objects" .lp An object in the Storage Manager is a container of bytes. It can be empty. It can have as many as 2\*[31\*] bytes, if the volume on which it resides is large enough. An object must fit on a single volume (storage device or partition). When an object is created, the Storage Manager gives the object a unique object identifier. An object identifier is described by a structure of the type OID, defined as follows:. .(b I \fBtypedef struct { SHORTPID pid; /* 32-bit page address of the object's header */ SLOTINDEX slot; /* 16-bit slot number of the object on the page */ VOLID volid; /* 16-bit identifier of the volume */ UNIQUE unique; /* 32-bit number generated at creation time */ } OID; \fR .(x z OID .)x \*($n .)b .lp The first three fields of an OID are the physical address of the object; they identify a volume, a page within the volume, and a \fIslot\fR on the page. .(x z slot .)x \*($n An object's identifier never changes. The client library sometimes moves objects, such as when an object grows beyond the size of a page, at which time the object is marked as \fIforwarded\fR, but its OID remains .(x z forwarded object .)x \*($n unchanged. .lp The \*(lqunique\*(rq field of an OID is special 32-bit value that is generated when the object is created and used to detect dangling and corrupted OIDs. The generation of unique numbers is discussed in Appendix B. .lp Every time an object is accessed by its OID, the Storage Manager validates the OID. The application can use the following macros to give an illegitimate initial value to an OID, and to recognize that value: .(b I \fBINVALIDATE_OID (OID oid)\fR .)b sets the \*(lqoid\*(rq argument to an invalid object identifier. .(b I \fBOID_IS_INVALID (OID oid)\fR .)b returns TRUE if \*(lqoid\*(rq is the value given by INVALIDATE_OID(\ ), FALSE if it is not. .lp Each object has an \fIobject header\fR, which describes the object, .(x z object header .)x \*($n and which can be retrieved without retrieving the object's data. The structure of an object header is shown below: .(b I \fBtypedef struct { TWO properties; /* a bit vector */ TWO tag; /* supplied by the application */ int size; /* size of the object in bytes */ } OBJHDR;\fR .(x z OBJHDR .)x \*($n .(x z object header .)x \*($n .)b .lp The \*(lqtag\*(rq is a two-byte field that the Storage Manager does not interpret. It is for use by the application. No restriction is put on the contents of \*(lqtag\*(rq fields. As its name implies, the \*(lqsize\*(rq field is the size of the object in bytes. The \*(lqproperties\*(rq field is a read-only bit-vector that indicates the presence or absence of the following properties of objects: .(b I .ip " P_LARGEOBJ" 23 set if the object is a large object. .ip " P_MOVED" 23 set if this object has been forwarded to another page. .ip " P_FROZEN" 23 set if the object is a frozen version. .ip " P_VERSIONED" 23 set if the object is a frozen version or a descendent of a frozen version. .)b .lp Each object resides in a \fIfile\fR on a \fIvolume\fR. .(x z file .)x \*($n .(x z volume .)x \*($n When an object is created, the application tells the client library in which file to place the object. Files and their uses are discussed in the next section; details of their use are not pertinent to understanding the operations on objects. .lp Before an operation can be performed on an existing object, the object, or at least the affected parts of the object, must be brought into the application's address space. This is called \fIpinning\fR the object or its parts. .(x z pin .)x \*($n When the object is no longer needed, it must be \fIunpinned\fR, .(x z unpin .)x \*($n to make room for other objects to be pinned\**. .(f \** Objects are pinned; pages are fixed. The gist of the two verbs is the same. .)f When the client library pins and object in order to perform an operation on behalf of the application (for example, appending bytes to an object), the client library pins the necessary parts of the object and unpins them before it returns control to the application. When the application pins part of an object for its own purposes (such as writing over bytes in the object), the pinned part is placed in the client's buffer pool, and the client library creates a \*(lqhandle\*(rq for the the object. The handle is called a \fIuser descriptor\fR. The application can refer to an object only through user descriptors. The application must unpin the object by \fIreleasing\fR the user descriptor when it is done using the object. .lp A user descriptor is called \fIvalid\fR if and only if the byte range it addresses is pinned. An application can pin an object or overlapping parts of an object any number of times, having any number of valid user descriptors for the same data in an object. (This is not wise for performance reasons, but it can be done.) .lp The client library functions that pin ranges of bytes return user descriptors to describe the bytes pinned. Functions that require that the range of bytes they affect be pinned take user descriptors as input arguments. The client library functions that do not take user descriptor arguments do not ultimately change the quantity of bytes pinned or the number of pages fixed in the buffer pool. \fBSuch functions may change the ranges of bytes addressed or the bytes themselves, but they do not change the quantity of bytes addressed.\fR (For example, the function sm_InsertInObject(\ ) may affect valid user descriptors even though it does not take and user descriptors as arguments.) .(x z user descriptor .)x \*($n .lp User descriptors have the following form: .(b I \fBtypedef struct { char *basePtr; /* ptr to start of data */ int byteCount /* number of bytes accessible */ int objectSize; /* total size of object */ TWO userFlags; /* properties field from object header */ TWO type; /* for use only by E */ TWO flags; /* for use only by E */ TWO tag; /* tag field from the object header */ OID oid; /* oid of object being referenced */ } USERDESC;\fR .)b .(x z USERDESC .)x \*($n .(x z user descriptor .)x \*($n .lp The \*(lqbasePtr\*(rq field of a user descriptor points to the start of the object's data in the buffer pool, while the \*(lqbyteCount\*(rq field indicates the number of bytes accessible to the application program through this user descriptor. The value \*(lqobjectSize\*(rq is the length of the entire object. The \*(lquserFlags\*(rq field holds a copy of the properties field from the object's header. The \*(lqtype\*(rq and \*(lqflags\*(rq fields are used by the E language's persistent virtual machine. Finally, the \*(lqtag\*(rq field contains a copy of the \*(lqtag\*(rq field in the object's header. .lp An object's data is referenced indirectly via the \*(lqbasePtr\*(rq field. \fBReferences by the application must always be indirect via \*(lqbasePtr\*(rq\fR. The indirection is necessary because there are times when the Storage Manager moves an object in the buffer pool, and the \*(lqbasePtr\*(rq of each user descriptor that references the object is updated to account for the move. .lp The remainder of this section describes the Storage Manager functions for operating on objects. It is divided into sub-sections that describe creating and destroying objects, pinning and unpinning parts of objects, modifying objects, and using object headers. .br .sh 3 "Creating and Destroying Objects" .sp .(b L \fBsm_CreateObject (groupIndex, fid, nearHint, nearObj, objHdr, length, data, oid) int groupIndex; /* IN buffer group to use */ FID *fid; /* IN file in which object is to be placed */ int nearHint; /* IN flag indicating where to create the new object */ OID *nearObj; /* IN create the new object near this object */ OBJHDR *objHdr; /* IN the object's header */ int length; /* IN amount of data */ void *data; /* IN the initial data for the object */ OID *oid; /* OUT the new object's OID */\fR .)b .(x z sm_CreateObject(\ ) .)x \*($n .lp Sm_CreateObject(\ ) creates an object in the file identified by \*(lqfid\*(rq. If \*(lqobjHdr\*(rq is not NULL, the \*(lqtag\*(rq field in the header of the new object is initialized with the contents of the \*(lqtag\*(rq field in the header structure addressed by \*(lqobjHdr\*(rq. When \*(lqdata\*(rq is not NULL, the object is initialized with the data addressed by the argument \*(lqdata\*(rq; in this case, \*(lqlength\*(rq specifies how much data to copy. When \*(lqdata\*(rq is NULL, an object of size \*(lqlength\*(rq is created and filled with zeroes. .sp The argument \*(lqnearHint\*(rq specifies where the new object should be created. The following values, defined in \fCsm_client.h\fR, are near hints: NEAR_OBJ, NEAR_FIRST, and NEAR_LAST. If \*(lqnearHint\*(rq is set to NEAR_OBJ, the new object is created near the object designated by \*(lqnearObj\*(rq. If \*(lqnearHint\*(rq is set to NEAR_FIRST or NEAR_LAST, \*(lqnearObj\*(rq is ignored and the new object is created near the first or last object in the file, respectively. .lp If sm_CreateObject(\ ) is successful, the OID structure pointed to by \*(lqoid\*(rq is filled with the OID of the new object. Sm_CreateObject(\ ) does not leave the new object pinned. .(b L \fBsm_DestroyObject (groupIndex, oid) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN the object to destroy */\fR .)b .(x z sm_DestroyObject(\ ) .)x \*($n .lp Sm_DestroyObject(\ ) destroys an object. If any user descriptors are valid for the object when the object is destroyed, they are made invalid, and they must be released with sm_ReleaseObject(\ ), described below. .br .sh 3 "Pinning and Unpinning Objects" .sp .lp The following two functions change the number of pages fixed in the client buffer pool. All the other functions that operate on objects fix pages temporarily and unfix the pages before returning. .(b L \fBsm_ReadObject (groupIndex, oid, start, length, userDesc) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN object to read */ int start; /* IN starting offset of read */ int length; /* IN amount of data to read */ USERDESC **userDesc; /* OUT descriptor to access the data */\fR .)b .(x z sm_ReadObject(\ ) .)x \*($n .lp Sm_ReadObject(\ ) reads part or all of the object identified by \*(lqoid\*(rq into the buffer group identified by \*(lqgroupIndex\*(rq. If \*(lqlength\*(rq is READ_ALL, the entire object is read (assuming that the size of the entire object is not greater than the amount of unpinned space in the buffer group). Otherwise, the bytes to be read are specified by \*(lqstart\*(rq and \*(lqlength\*(rq. .lp Sm_ReadObject(\ ) pins the specified range of bytes in the buffer pool and returns a user descriptor to the caller. .(x z pin, object .)x \*($n \fBBytes pinned in the buffer pool by sm_ReadObject(\ ) remain pinned until they are explicitly released by sm_ReleaseObject(\ ).\fR .lp While sm_ReadObject(\ ) can be used to get information about the object (from the object header) by giving it a length of zero, sm_ReadObjectHeader(\ ) is the preferred way to meet the same objective. Sm_ReadObject(\ ) performs work that is unnecessary when only the object header is of interest, and it always fixes at least one page in the buffer pool, even if the given length is zero. .lp The user descriptor consumes resources that must be freed with sm_ReleaseObject(\ ), even if the object is not pinned .(x z sm_ReleaseObject(\ ) .)x \*($n (zero is given for \*(lqlength\*(rq). .sp .(b L \fBsm_ReleaseObject (userDesc) USERDESC *userDesc; /* IN descriptor returned by ReadObject */\fR .)b .(x z sm_ReleaseObject(\ ) .)x \*($n .lp Sm_ReleaseObject(\ ) unpins a range of bytes of an object that was pinned by sm_ReadObject(\ ), and frees the resources associated with the user descriptor. If the user descriptor is not valid, sm_ReleaseObject(\ ) sets sm_errno to esmBADUSERDESC and returns esmFAILURE. .br .sh 3 "Modifying Objects" .lp Four functions modify objects: .(x z object, modifying .)x \*($n sm_WriteObject(\ ), sm_InsertInObject(\ ), sm_AppendToObject(\ ), and sm_DeleteFromObject(\ ). Sm_WriteObject(\ ) cannot be used to change the size of an object, .(x z sm_WriteObject(\ ) .)x \*($n only to overwrite parts of an object. The other three functions can change the size of an object. These functions provide substantial flexibility, and their efficiency varies. Changing the size of a small object (one that fits on a disk page) is relatively inexpensive. It is less expensive than reading and writing the object. For large objects, performing many small-size changes can be expensive in CPU time and buffer space utilization. If a large object is pinned several times simultaneously, through different user descriptors, updates to the object are very expensive. If a large number of small-size changes is required, we recommend accumulating the changes and performing them in larger chunks. .sp .(b L \fBsm_WriteObject (groupIndex, start, length, data, userDesc, release) int groupIndex; /* IN buffer group in use */ int start; /* IN starting offset of write */ int length; /* IN amount of data to be written */ void *data; /* IN pointer to the data */ USERDESC *userDesc; /* IN descriptor returned by ReadObject */ BOOL release; /* IN whether to release the object */\fR .)b .(x z sm_WriteObject(\ ) .)x \*($n .lp Sm_WriteObject(\ ) overwrites the region of bytes from (userDesc->baseptr\ +\ start) to (userDesc->baseptr\ +\ start\ +\ length\ -\ 1) with the data addressed by the \*(lqdata\*(rq argument. The given byte range must have been pinned (which means that the user descriptor must be valid). If \*(lqrelease\*(rq is TRUE, the range of bytes given by \*(lquserDesc\*(rq is unpinned when sm_WriteObject(\ ) returns. If \*(lqdata\*(rq is NULL, the region is filled with zeroes. \fBAll updates to objects must be performed using sm_WriteObject(\ )\fR so that the updates can be logged, and the transaction semantics can be guaranteed. .sp .(b L \fBsm_InsertInObject (groupIndex, oid, start, length, data) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN object we're inserting into */ int start; /* IN starting offset of insert */ int length; /* IN amount of data being inserted */ void *data; /* IN data to insert */\fR .)b .(x z sm_InsertInObject(\ ) .)x \*($n .lp Sm_InsertInObject(\ ) inserts \*(lqlength\*(rq bytes of data into an object, beginning at the offset \*(lqstart\*(rq. If \*(lqdata\*(rq is NULL, the inserted region is filled with zeroes. If there are any valid user descriptors (those for which sm_ReleaseObject(\ ) has not been called) for the object at the time the insertion takes place, they are reestablished if necessary. After the insertion, the base pointers of the valid user descriptors point to the byte within the object indicated by the \*(lqstart\*(rq argument to the sm_ReadObject(\ ) operation that created the user descriptor. For example, an object's first five bytes, "ABCDE" are pinned by sm_ReadObject(\ ), which was called with a \*(lqstart\*(rq offset of zero and a \*(lqlength\*(rq of five. Sm_ReadObject(\ ) returns a user descriptor, U, which addresses "ABCDE". Sm_InsertInObject(\ ) inserts "ZZ" at \*(lqstart\*(rq offset zero. The user descriptor U now addresses "ZZABC", which are pinned, while the bytes "DE" are no longer pinned. .sp .(b L \fBsm_AppendToObject (groupIndex, oid, length, data) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN object we are appending data to */ int length; /* IN amount of data being appended */ void *data; /* IN data to append */\fR .)b .(x z sm_AppendToObject(\ ) .)x \*($n .lp Sm_AppendToObject(\ ) appends \*(lqlength\*(rq bytes of data to the end of an object. Outstanding user descriptors are handled the same way as sm_InsertInObject(\ ). If \*(lqdata\*(rq is NULL, the appended region is filled with zeroes. .sp .(b L \fBsm_DeleteFromObject (groupIndex, oid, start, length) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN object we're inserting into */ int start; /* IN starting offset of delete */ int length; /* IN amount of data being deleted */\fR .)b .(x z sm_DeleteFromObject(\ ) .)x \*($n .lp Sm_DeleteFromObject(\ ) deletes \*(lqlength\*(rq bytes of data from an object, beginning with the byte indicated by the offset \*(lqstart\*(rq. .(x z user descriptor .)x \*($n Sm_DeleteFromObject(\ ) is analogous to sm_InsertObject(\ ). All valid user descriptors affected by the deletion are, if possible, reset to point to the new absolute offset within the object. There are two cases when this is not possible. .np The object's size is now smaller than the starting offset of a user descriptor. The \*(lqbasePtr\*(rq field in the user descriptor is set to NULL and the user descriptor is made invalid. The user descriptor must be released by sm_ReleaseObject(\ ) so that its resources can be reclaimed. .np The object's size is now smaller than the original byte range addressable by a user descriptor. The size of the range addressable by the descriptor is reduced to reflect the new size of the object. .br .sh 3 "Object Headers" .(x z object header .)x \*($n .sp .(b L \fBsm_ReadObjectHeader (groupIndex, oid, objHdr) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN read this object's header */ OBJHDR *objHdr; /* OUT place to put the header */\fR .)b .(x z sm_ReadObjectHeader .)x \*($n .lp Sm_ReadObjectHeader(\ ) reads an object's header into the structure addressed by \*(lqobjHdr\*(rq. This function is the preferred one to use to determine if an object's identifier is valid. If the object's identifier is invalid, Sm_ReadObjectHeader(\ ) returns esmFAILURE and puts esmBADOID in sm_errno. .sp .(b L \fBsm_SetObjectHeader (groupIndex, oid, objHdr) int groupIndex; /* IN buffer group in use */ OID *oid; /* IN set this object's header flags */ OBJHDR *objHdr; /* IN the new header */\fR .)b .(x z sm_SetObjectHeader .)x \*($n .lp Sm_SetObjectHeader(\ ) modifies an object's header. Only the \*(lqtags\*(rq field is modified; the other fields are read-only. .br .sh 2 "Versions of Objects" .(x z version of object .)x \*($n .lp In order to allow efficient updating of shared data, the Storage Manager offers versions of objects. Versions come in two kinds: \fIworking versions\fR and \fIfrozen versions\fR. .(x z version, working .)x \*($n .(x z working version .)x \*($n .(x z version, frozen .)x \*($n .(x z frozen version .)x \*($n A working version of an object is one that can be modified. Every object has at least one version, which is the object itself. A working version may be frozen, after which it can no longer be modified. .lp A new working version, called a \fIdescendent\fR, can be made of a frozen object. The descendent looks like a new object that is a copy of the frozen object from which it came. The Storage Manager determines when it is necessary and efficient to make a copy of the frozen object, and makes the copy at that time. .sp .(b L \fBsm_CreateVersion (groupIndex, nearHint, parentObj, nearObj, oid) int groupIndex; /* IN buffer group to use */ int nearHint; /* IN flag indicating where to create the new version near */ OID *parentObj; /* IN object to create a version of */ OID *nearObj; /* IN create the new version near this object */ OID *oid; /* OUT the new version's OID */\fR .)b .(x z sm_CreateVersion .)x \*($n .lp Sm_CreateVersion(\ ) creates a new version of the object \*(lqparentObj\*(rq in the file containing \*(lqparentObj\*(rq. The arguments \*(lqgroupIndex\*(rq, \*(lqnearHint\*(rq, and \*(lqnearObj\*(rq are used as in sm_CreateObject(\ ). The object identifier of the new version is returned in \*(lqoid\*(rq. \fBThe object identified by \*(lqparentObj\*(rq must be a frozen version\fR. The new version is a working version. The new version can be destroyed using sm_DestroyObject(\ ). When a new version is created, the P_VERSIONED property is set in the object header. .lp Like sm_CreateObject(\ ), sm_CreateVersion(\ ) does not leave anything pinned in the buffer pool. .sp .(b L \fBsm_FreezeVersion (groupIndex, oid) int groupIndex; /* IN buffer group to use */ OID *oid; /* IN object to be frozen */\fR .)b .(x z sm_FreezeVersion .)x \*($n .lp Sm_FreezeVersion(\ ) marks an object as frozen, preventing subsequent modification of the object, and allowing new working versions to be made from this object. When an object is frozen, both the P_VERSIONED and the P_FROZEN properties are set in the object header. Once frozen, an object cannot be unfrozen. A frozen object can be destroyed. .br .sh 2 "Operations on Files" .(x z file, what is a .)x \*($n .(x z file, operations on .)x \*($n .lp A Storage Manager file is a flexible container in which objects are place when they are created. No object exists outside a file. .lp The objects in a file can be \fIscanned\fR, meaning that .(x z scanning a file .)x \*($n they are visited exactly once. .lp Files do not have preallocated space or ownership properties. Various consistency guarantees can be associated with files, with the effect that updating data in different files has different costs. .lp The Storage Manager offers operations for creating, destroying, scanning, bulk-loading files, and for changing the consistency guarantees associated with files. Some operations on files acquire locks on entire files. The locks acquired are described in Appendix A. .lp A file is identified by a unique file identifier or FID. The Storage Manager does not provide a way to find all files or file identifiers that exist, so it is left to the application to keep track of its file identifiers. .(x z FID .)x \*($n .(x z file identifier .)x \*($n For example, consider an application that embeds file identifiers in objects to create a logical hierarchy of files. The application had best destroy the files in a depth-first fashion, lest it lose a file identifier before the file it identifies is destroyed. .lp The following two macros can be used to give a file identifier an illegitimate initial value, and later to recognize that value: .(b I \fBINVALIDATE_FID (FID fid)\fR .)b .(x z INVALIDATE_FID .)x \*($n sets \*(lqfid\*(rq to an invalid file identifier. .(b I \fBFID_IS_INVALID (FID fid)\fR .)b .(x z FID_IS_INVALID .)x \*($n returns TRUE if \*(lqfid\*(rq is the invalid identifier given by INVALIDATE_FID(\ ), FALSE otherwise. .lp The rest of this section describes operations on files and operations that concern entire files of objects. .sh 3 "Consistency Guarantees for Files" .(x z files, consistency guarantees for .)x \*($n .lp The \fIlog level\fR of a file determines what .(x z files, log level .)x \*($n level of consistency is maintained for the file in the event that a transaction aborts or a server crashes. There are two log levels for files on data volumes: LOG_ALL and LOG_SPACE. LOG_ALL indicates that consistency is maintained for user data and meta-data. LOG_SPACE indicates that meta-data are guaranteed to be consistent. This means that all objects are available and that they are the correct size, but their contents may be corrupted. Files that have their log level set to LOG_SPACE are flushed when the transaction is committed. \fBData pages for large objects (objects that do not fit on a single disk page) may not be flushed, so there is no guarantee that the data is safely on disk until the server dismounts the volume.\fR The log level is not a permanent attribute of a file. When an application sets the log level for a file, the setting lasts until it is changed or until sm_ShutDown(\ ) is called. If, in a transaction, the log level for a file is changed from LOG_SPACE to LOG_ALL, the Storage Manager guarantees only that the meta-data are consistent. .lp LOG_ALL is the default log level for data files. .(x z log level, default .)x \*($n .(x z default log level .)x \*($n LOG_SPACE is designed to conserve log space and increase performance for those files whose data integrity is not critical. For example, results of a query may be stored in a file with its log level set to LOG_SPACE, since file can be regenerated, in the event of a failure. To conserve log space when loading a large file, the log level for a file may be set to LOG_SPACE. Once the loading transaction is committed, the log level should be set to LOG_ALL. .lp Files on temporary volumes can have only one log level: LOG_NONE. .(x z temporary volume .)x \*($n See Section 5.1.3, \fBTemporary Volumes\fR, for more information about temporary volumes. .lp Sm_SetLogLevel(\ ) is used to change the log level for a list of files: .sp .(b L \fBsm_SetLogLevel (logLevel, fileCount, fids) int logLevel; /* IN log level */ int fileCount; /* IN number of files to set level for */ FID fid[]; /* IN list of files to set level for */ \fR .)b .(x z sm_SetLogLevel(\ ) .)x \*($n .lp The \*(lqlogLevel\*(rq argument takes the values LOG_SPACE and LOG_ALL. The \*(lqfileCount\*(rq argument indicates the size of the last argument, \*(lqfid[]\*(rq, which is a list of file identifiers of the files whose log levels are to be affected by this function. It is not an error for a file in the list already to have the given log level. .lp If \*(lqfileCount\*(rq is zero, \fBall\fR files are given \*(lqlogLevel\*(rq. .lp The volumes on which the files reside must be available for mounting, and a side effect of setting the log level is that the volumes are mounted. .lp Sm_SetLogLevel(\ ) has no effect on files that reside on temporary volumes .(x z temporary volume .)x \*($n (see Section 5.1.3, \fBTemporary Volumes\fR). .(b L \fBsm_CreateFile (groupIndex, volid, fid) int groupIndex; /* IN buffer group in use */ VOLID volid; /* IN the volume in which to place the file */ FID *fid; /* OUT the file ID of the new file */ .)b .(x z sm_CreateFile(\ ) .)x \*($n .lp Sm_CreateFile(\ ) creates a new file on the volume indicated by \*(lqvolid\*(rq. The file identifier of the new file is returned in the structure to which \*(lqfid\*(rq points. The caller is responsible for allocating space for the FID. .sp .(b L \fBsm_DestroyFile (groupIndex, fid) int groupIndex; /* IN buffer group in use */ FID *fid; /* IN the file to destroy */\fR .)b .(x z sm_DestroyFile(\ ) .)x \*($n .lp Sm_DestroyFile(\ ) destroys the file identified by \*(lqfid\*(rq. The objects in the file are destroyed along with the file. Disk space is released when the transaction is committed. .sp .(b L \fBsm_GetFirstOid (groupIndex, fid, oid, objHdr, emptyFlag) int groupIndex; /* IN buffer group in use */ FID *fid; /* IN the file */ OID *oid; /* OUT first OID */ OBJHDR *objHdr; /* OUT the object's header */ BOOL *emptyFlag; /* OUT empty file flag */\fR .)b .(x z sm_GetFirstOid(\ ) .)x \*($n .lp Sm_GetFirstOid(\ ) retrieves the object identifier and the object header of the first object in the file designated by \*(lqfid\*(rq. The first object is the first object on the first physical page in the file. If the file does not contain any objects, \*(lqemptyFlag\*(rq is set to TRUE. .sp .(b L \fBsm_GetLastOid (groupIndex, fid, oid, objHdr, emptyFlag) int groupIndex; /* IN buffer group in use */ FID *fid; /* IN the file */ OID *oid; /* OUT last OID */ OBJHDR *objHdr; /* OUT the object's header */ BOOL *emptyFlag; /* OUT empty file flag */\fR .)b .(x z sm_GetLastOid(\ ) .)x \*($n .lp Sm_GetLastOid(\ ) retrieves the object identifier and the object header of the last object in the file designated by \*(lqfid\*(rq. The last object is the last object on the last physical page in the file. If the file does not contain any objects, \*(lqemptyFlag\*(rq is set to TRUE. .sp .(b L \fBsm_GetNextOid (groupIndex, baseOid, nextOid, objHdr, endMarker) int groupIndex; /* IN buffer group in use */ OID *baseOid; /* IN next relative to this object */ OID *nextOid; /* OUT OID of the next object */ OBJHDR *objHdr; /* OUT the object's header */ BOOL *endMarker; /* OUT end-of-file flag */\fR .)b .(x z sm_GetNextOid(\ ) .)x \*($n .lp Sm_GetNextOid(\ ) retrieves the object identifier and the object header of the next object in the file relative to the object addressed by \*(lqbaseOid\*(rq. \*(lqEndMarker\*(rq is set to TRUE when end-of-file is reached (i.e., when there is no next object for sm_GetNextOid(\ ) to return). .lp The next object is that which resides \fIphysically\fR next in the file. There is no way to scan a file's objects in the order in which they were inserted in the file. .lp The preferred method for retrieving all the objects in a file is to use scans, described in the next sub-section. Scans are more efficient than using sm_GetNextOid(\ ), which is present for backward compatibility. .sp .(b L \fBsm_GetPreviousOid (groupIndex, baseOid, prevOid, objHdr, endMarker) int groupIndex; /* IN buffer group in use */ OID *baseOid; /* IN previous relative to this object */ OID *prevOid; /* OID of the previous object */ OBJHDR *objHdr; /* OUT the object's header */ BOOL *endMarker; /* OUT start-of-file flag */ .)b .(x z sm_GetPreviousOid(\ ) .)x \*($n .lp Sm_GetPreviousOid(\ ) retrieves the object identifier and object header of the previous object in the file relative to the object addressed by \*(lqbaseOid\*(rq. \*(lqEndMarker\*(rq is set to TRUE when start-of-file is reached (i.e., when there is no next object for sm_GetPreviousOid(\ ) to return). Much like sm_GetNextOid(\ ), the previous object is the object that is \fIphysically\fR previous in the file. .br .sh 3 "Scanning Files" .lp The objects in a file can be visited most efficiently by scanning the file. During a \fIscan\fR, the client library locks the entire file so that while one application is using the file, objects cannot be inserted, deleted, or changed by another application. \fBThe Storage Manager does not support a single application's modifying a file during a scan.\fR The client library also some information about the state of the scan and the structure of the file being scanned. The information is stored in a \fIscan descriptor\fR, a structure of type \fBSCANDESC\fR, which is meant to be treated as \fIopaque\fR by the application. .sp .(b L \fBsm_OpenScanWithGroup (fid, type, groupIndex, scanDesc, oid) FID *fid; /* IN file to scan */ int type; /* IN type of scan -- UNUSED */ int groupIndex; /* IN buffer group for use in scan */ SCANDESC **scanDesc; /* OUT returned scan descriptor */ OID *oid; /* IN optional oid to begin scan -- UNUSED */ .)b .(x z sm_OpenScanWithGroup(\ ) .)x \*($n .lp Sm_OpenScanWithGroup(\ ) initializes a scan on the file indicated by \*(lqfid\*(rq. A scan descriptor is passed back in \*(lqscanDesc\*(rq, for use in subsequent scan calls. Using the scan mechanism can be considerably more efficient that using the sm_GetNextOid(\ ) call or sm_ReadObject(\ ). Scans use a buffer group, \*(lqgroupIndex\*(rq. This group should be created with the most-recently-used replacement policy, and its size should be tuned to reflect the buffering requirements for the scan. The buffer group should have a size of at least five pages. .lp Objects are scanned in the order in which they physically reside on disk. After sm_OpenScanWithGroup(\ ) returns, the scan pointer is before the first object in the file. This is true even if the file is empty, in which case the first call to sm_ScanNextObject(\ ) returns a flag indicating the end-of-file condition. The \*(lqtype\*(rq and \*(lqoid\*(rq arguments are not used and are present for backward compatibility. .sp .(b L \fBsm_OpenScan (fid, type, groupSize, scanDesc, oid) FID *fid; /* IN file to scan */ int type; /* IN type of scan -- UNUSED */ int groupSize; /* IN size of buffer group in pages */ SCANDESC **scanDesc; /* OUT returned scan descriptor */ OID *oid; /* IN optional oid to begin scan -- UNUSED */ .)b .(x z sm_OpenScan(\ ) .)x \*($n .lp Sm_OpenScan(\ ) is like sm_OpenScanWithGroup(\ ), but it is less flexible, and it is provided for backward compatibility. It is identical to sm_OpenScanWithGroup(\ ) except that it creates a buffer group with the most-recently-used replacement policy and size \*(lqgroupSize\*(rq. \*(lqGroupSize\*(rq should be at least five (pages). The buffer group is destroyed when the scan is closed. .sp .(b L \fBsm_ScanNextObject (scanDesc, start, length, retDesc, eof) SCANDESC *scanDesc; /* IN scan descriptor */ int start; /* IN starting offset in object */ int length; /* IN number of bytes to read */ USERDESC **retDesc; /* OUT descriptor to access the data */ BOOL *eof; /* OUT end of file indicator */ .)b .(x z sm_ScanNextObject(\ ) .)x \*($n .lp sm_ScanNextObject(\ ) reads the next object in the file and pins the object as if sm_ReadObject(\ ) were used. \*(lqScanDesc\*(rq is the scan descriptor returned when the scan was opened. \*(lqStart\*(rq is the starting offset within the object to return. .lp \*(lqLength\*(rq is the length of the object read to perform. If \*(lqlength\*(rq is READ_ALL, the entire object is read (assuming that the size of the entire object is not greater than the amount of unpinned space in the buffer group). To obtain the object header and OID information for the object, use a \*(lqlength\*(rq of zero. .lp sm_ScanNextObject(\ ) returns a user descriptor for the object, if there is one to pin, whether or not any bytes are pinned. \*(lqEof\*(rq is set to TRUE and \*(lqretDesc\*(rq is set to NULL when there are no more objects to be scanned. Each call to sm_ScanNextObject(\ ) releases the user descriptor returned by the previous scan call, so \fBsm_ReleaseObject(\ ) must not be used\fR .(x z sm_ReleaseObject(\ ) .)x \*($n on user descriptors that are acquired by scanning files. .sp .(b L \fBsm_ScanNextBytes (scanDesc, length) SCANDESC *scanDesc; /* IN scan descriptor */ int length; /* IN number of bytes to read */ .)b .(x z sm_ScanNextBytes(\ ) .)x \*($n .lp Sm_ScanNextBytes(\ ) is useful when a file being scanned contains very large objects that cannot be expected to fit in memory. A sm_ScanNextObject(\ ) call can be made with a relatively small length to read in the first section of an object. Sm_ScanNextBytes(\ ) is used subsequently to iterate over the rest of that object, with each call reading in the next \*(lqlength\*(rq bytes of the current scan object. The iteration can be controlled by observing the objectSize field of the user descriptor. esmENDOFOBJECT is returned if there are no more bytes to be read in the current object. .sp .(b L \fBsm_CloseScan (scanDesc) SCANDESC *scanDesc; /* IN scan descriptor */ .)b .(x z sm_CloseScan(\ ) .)x \*($n .lp Sm_CloseScan(\ ) closes the scan associated with \*(lqscanDesc\*(rq. It releases the scan descriptor and the user descriptors and data pinned during the scan. .br .sh 3 "Bulk-loading Files" .lp \fBWARNING\fR: the file bulk load facility does not work properly in version \*V. We recommend that it not be used. .sp .(b L \fBsm_OpenLoad (fid, type, groupSize, fillFactor, loadDesc) FID *fid; /* IN file to scan */ int groupSize; /* IN size of load buffer group */ float fillFactor; /* IN fill percentage */ LOADDESC **loadDesc; /* OUT returned load descriptor */ .)b .(x z sm_OpenLoad(\ ) .)x \*($n .lp Sm_OpenLoad(\ ) prepares to load a set of objects into a file in bulk. Bulk loading a file can be more efficient than using a series of sm_CreateObject(\ ) calls. The file, indicated by \*(lqfid\*(rq, need not be empty, in which case the new objects are added to the end of the file. The load mechanism creates and uses its own buffer group; the size of the buffer group is \*(lqgroupSize\*(rq. .\" The \*(lqfillFactor\*(rq indicates how full to fill pages with objects. .\" A valid value is in the range 0.00 to 1.00, inclusive; 0.00 indicates empty and 1.00 indicates full. The \*(lqfillFactor\*(rq argument is ignored; it is present for future extensions. A \fIload descriptor\fR, \*(lqloadDesc\*(rq is .(x z load descriptor .)x \*($n returned for use in subsequent operations ( sm_LoadNextObject(\ ) and sm_CloseLoad(\ )). .sp .(b L \fBsm_LoadNextObject (loadDesc, length, data, oid) LOADDESC *loadDesc; /* IN load descriptor */ int length; /* IN length of the object */ void *data; /* IN the object's data */ OID *oid; /* OUT returned new object id */ .)b .(x z sm_LoadNextObject(\ ) .)x \*($n .lp Sm_LoadNextObject(\ ) creates a new object if size \*(lqlength\*(rq in the file for which the \*(lqloadDesc\*(rq was opened. The new object is initialized with \*(lqdata\*(rq. If \*(lqdata\*(rq is NULL, the object is filled with zeroes. Sm_LoadNextObject(\ ) returns an object identifier for the new object in \*(lqoid\*(rq. .sp .(b L \fBsm_CloseLoad (loadDesc) LOADDESC *loadDesc; /* IN load to close */\fR .)b .(x z sm_CloseLoad(\ ) .)x \*($n .lp Sm_CloseLoad(\ ) ends the bulk-load operation. .br .sh 2 "Operations on Indexes" .(x z index, operations on .)x \*($n .lp The Storage Manager's index facility associates keys with fixed-length elements. The keys can be any basic C data type (SM_int, SM_long, SM_short, SM_float, SM_double) or strings (SM_string). The size of the element is fixed when the index is created. .lp B\*[+\*]tree index and linear hashing index functions are implemented. B\*[+\*]tree provides fast index lookup on all kinds of queries, especially range queries. Linear hashing provides even faster index lookup and supports linear space growth for dynamically growing indexes, but it supports only exact-match queries. More information about linear hashing can be found in [Litw88]. .lp A key is fully described by the \fBKEY\fR structure: .sp .(b L \fBtypedef struct { TWO length; /* length of the key */ void* valuePtr; /* pointer to value of the key */ } KEY; \fR .)b .(x z KEY .)x \*($n .lp Index keys are compared according to the key type given when the index is created. The key type determines the number of bytes considered in a key comparison. In the case of keys that are strings, the length fields in the keys in question determine the number of bytes compared. Strings are compared one character at a time. The client library does not terminate strings with nulls. When two strings of different lengths are compared, the shorter string is compared with the corresponding substring of the longer string. If the shorter string and the corresponding substring are equal, the longer string is considered to be the larger of the two. This means that "abc\0" is longer than "abc". .lp Characters are compared as ASCII values. .sh 3 "Creating and Destroying Indexes " .lp When an index is created, the client library creates a handle, by which the index is identified in subsequent operations. The handle is an \fIindex identifier\fR, a structure of type IID. .(x z index identifier .)x \*($n The value of the index identifier can be treated as an opaque value by the application. .lp The following macros can be used it give an illegitimate initial value to an index identifier, and later to recognize that value: .(b I \fBINVALIDATE_IID (IID iid)\fR .)b sets \*(lqiid\*(rq to an invalid index identifier. .(b I \fBIID_IS_INVALID (IID iid)\fR .)b returns TRUE if \*(lqiid\*(rq has the value given by INVALIDATE_IID(\ ), FALSE if not. .lp The rest of this section describes the functions that .(x z index, operations on .)x \*($n .(x z index .)x \*($n operate on indexes. .sp .(b L \fBsm_CreateIndex(volume, groupIndex, ndxType, keyType, maxKeyLen, elSize, unique, ndx) VOLID volume; /* IN volume on which index is to be built */ int groupIndex; /* IN the buffer group to use */ SMTYPE ndxType; /* IN SM_BTREENDX, SM_HASHNDX, etc */ SMDATATYPE keyType; /* IN SM_int, SM_long, SM_string, etc */ int maxKeyLen; /* IN maximum key length of a key in the index */ int elSize; /* IN element size (mpl of 4, < SM_MAXELEMLEN) */ BOOL unique; /* IN TRUE if key is unique */ IID* ndx; /* OUT returned index identifier */ \fR .)b .(x z sm_CreateIndex(\ ) .)x \*($n .lp Sm_CreateIndex(\ ) creates an index that resides on \*(lqvolume\*(rq. \** .(f \** Indexes on temporary volumes are not implemented. (Section 5.1.3, \fBTemporary Volumes\fR). If the volume given is temporary, sm_CreateIndex(\ ) returns esmFAILURE, with error code esmNOTIMPLEMENTED. .)f \*(lqNdxType\*(rq specifies the type of index (SM_BTREENDX for B\*[+\*]tree or SM_HASHNDX for linear hashing). \*(lqKeyType\*(rq indicates the data type of the key. The maximum length of a key in the index is given in \*(lqmaxKeyLen\*(rq. The size of the elements in the index is given in \*(lqelSize\*(rq. The element size must be a multiple of four and less than SM_MAXELEMLEN (20). If \*(lqunique\*(rq is FALSE, the index is able to store multiple elements under the same key. An index identifier is returned in \*(lqndx\*(rq upon successful completion. .sp .(b L \fBsm_DestroyIndex(ndx, groupIndex) IID* ndx; /* IN id of index to destroy */ int groupIndex; /* IN which buffer group to use */\fR .)b .(x z sm_DestroyIndex(\ ) .)x \*($n .lp Sm_DestroyIndex(\ ) destroys the index associated with \*(lqndx\*(rq. .sp .(b L \fBsm_SetLHashLoadThreshold(ndx, groupIndex, load) IID* ndx; /* IN index identifier */ int groupIndex; /* IN which buffer group to use */ float loadFactor; /* IN the load factor to use for linear hashing */ .)b .(x z sm_SetLHashLoadThreshold(\ ) .)x \*($n .lp Sm_SetLHashLoadThreshold(\ ) changes the load factor for a linear hashing index from the default 75% to the given \*(lqloadFactor\*(rq. .(x z load factor, default for linear hashing indexes .)x \*($n .(x z default load factor .)x \*($n The default load factor, 75%, yields the best access time and space utilization. See [Litw88] for information about linear hashing and when it might be useful to change the load factor. The load factor can be set only on a newly created index. .br .sh 3 "Inserting and Removing Index Elements " .sp .(b L \fBsm_InsertEntry(ndx, groupIndex, key, elem) IID* ndx; /* IN index identifier */ int groupIndex; /* IN which buffer group to use */ KEY* key; /* IN key to insert */ void* elem; /* IN element associated with key */ \fR .)b .(x z sm_InsertEntry(\ ) .)x \*($n .lp Sm_InsertEntry(\ ) inserts a <key, elem> pair into the index \*(lqndx\*(rq. If \*(lqndx\*(rq is a unique index and the key to be inserted already appears in the index, sm_InsertEntry(\ ) returns an error in sm_errno. If the index is not unique, there is no limit to the number of duplicate keys as long as different elements are associated with them. .sp .(b L \fBsm_RemoveEntry(ndx, groupIndex, key, elem) IID* ndx; /* IN index identifier */ int groupIndex; /* IN which buffer group to use */ KEY* key; /* IN key to remove */ void* elem; /* IN element associated with key */ \fR .)b .(x z sm_RemoveEntry(\ ) .)x \*($n .lp Sm_RemoveEntry(\ ) removes a <key, elem> pair from the index \*(lqndx\*(rq. .br .sh 3 "Loading Indexes in Bulk" .lp The Storage Manager provides a bulk-load facility for efficiently loading an empty index. When the application begins a bulk-load operation, the client library allocates a temporary run-buffer, which is used for sorting runs. Henceforth, the application uses sm_InsertEntry(\ ) repeatedly to load elements into index; no other index operations are allowed during a bulk-load. Each sm_InsertEntry(\ ) operation for the index inserts a <key, elem> pair into the temporary run buffer. The run buffer is sorted and written to the work file as a \*(lqsorted-run\*(rq when it is full. When the application terminates the bulk-load operation, the client library merges the sorted-runs into a sorted stream, from which the index is built from the bottom, up. .lp Entries cannot be removed during a bulk-load operation. .sp 2 .(b L \fBint sm_BeginIndexLoad(ndx, groupIndex, workVolume, runSize) IID* ndx; /* IN index identifier */ int groupIndex; /* IN the buffer group to use */ VOLID workVolume; /* IN work volume */ int runSize; /* IN size of each sorted run in pages */ \fR .)b Sm_BeginIndexLoad(\ ) prepares to load the index given in \*(lqndx\*(rq, using the buffer group \*(lqgroupIndex\*(rq. Sm_BeginIndexLoad(\ ) uses the volume named by \*(lqworkVolume\*(rq for the sorted runs. Using a temporary volume for the work volume yields .(x z temporary volume .)x \*($n the best performance (see Section 5.1.3, \fBTemporary Volumes\fR). .lp The \*(lqrunSize\*(rq argument determines how many MIN_PAGESIZE pages to fill before ending a run. The larger \*(lqrunSize\*(rq, the more memory is consumed by the bulk-load, with a commensurate improvement in speed. Sm_BeginIndexLoad(\ ), if it is used, must be the first operation performed on an index. .sp 2 .(b L \fBint sm_EndIndexLoad(ndx) IID* ndx; /* IN index identifier */ \fR .)b Sm_EndIndexLoad(\ ) concludes the bulk-load and builds the index. .sp .(b L 2 \fBint sm_AbortIndexLoad(ndx) IID* ndx; /* IN index identifier */ \fR .)b sm_AbortIndexLoad(\ ) aborts the bulk-loading of an index. All resources used by the index are freed. .br .sh 3 "Scanning Indexes" .lp Indexes are used by posing queries with the sm_FetchInit(\ ) operation. A query requests all the elements whose key values lie in a range. The results of the query are fetched, one element at a time, with the sm_FetchNext(\ ) operation. An index scan uses a \fIcursor\fR, a value of the type SMCURSOR. .(x z cursor .)x \*($n A cursor can be treated by the application as an opaque value. The following two macros give a cursor an invalid initial value and recognize that value: .(b I \fBINVALIDATE_CURSOR (SMCURSOR cursor)\fR .)b sets \*(lqcursor\*(rq to an invalid index scan cursor. .(b I \fBCURSOR_IS_INVALID (SMCURSOR cursor)\fR .)b returns TRUE if \*(lqcursor\*(rq is the value given by INVALIDATE_CURSOR(\ ), FALSE if not. .lp The rest of this section describes the functions used to scan indexes. .sp .(b L \fBsm_FetchInit(ndx, groupIndex, bound1, cond1, bound2, cond2, cursor) IID* ndx; /* IN index identifier */ int groupIndex; /* IN which buffer group to use */ KEY* bound1; /* IN starting bound of the scan */ SMCOND cond1; /* IN starting condition */ KEY* bound2; /* IN ending bound of the scan */ SMCOND cond2; /* IN ending condition */ SMCURSOR* cursor; /* OUT returned pointer if non-NULL */\fR .)b .(x z scanning an index .)x \*($n .(x z index scan .)x \*($n .(x z sm_FetchInit(\ ) .)x \*($n .lp Sm_FetchInit(\ ) begins a scan on the index \*(lqndx\*(rq. The arguments \*(lqbound1\*(rq and \*(lqcond1\*(rq specify the beginning search condition. \*(lqBound2\*(rq and \*(lqcond2\*(rq specify the ending search condition. The conditions can be SM_EQ, SM_G, SM_L, SM_GEQ, or SM_LEQ. The \*(lqcursor\*(rq argument is initialized by sm_FetchInit(\ ) and used by sm_FetchNext(\ ). The caller is responsible for allocating the space for the cursor and the client library is responsible for the value of the cursor. .sp The direction of the scan (ascending or descending) is determined by the bounds and conditions of the query. The beginning and end of an index are specified with the macros SM_BOF and SM_EOF. For linear hashing indexes (type SM_HASHNDX), the value that .(x z index query .)x \*($n .(x z query, index .)x \*($n can be used for \*(lqcond1\*(rq and \*(lqcond2\*(rq is SM_EQ. .sp Several examples of queries follow: .np Scan from key1 = \*(lq10\*(rq to key2 = \*(lq30\*(rq inclusively: .br sm_FetchInit( ..., key1, SM_GEQ, key2, SM_LEQ, cursor) --- ascending .br sm_FetchInit( ..., key2, SM_LEQ, key1, SM_GEQ, cursor) --- descending .sp .np Scan from key1 = \*(lq10\*(rq to the end of the index: .br sm_FetchInit( ..., key1, SM_GEQ, SM_EOF, cursor) --- ascending .br sm_FetchInit( ..., SM_EOF, key1, SM_GEQ, cursor) --- descending .sp .np Scan the whole index: .br sm_FetchInit( ..., SM_BOF, SM_EOF, cursor) --- ascending .br sm_FetchInit( ..., SM_EOF, SM_BOF, cursor) --- descending .lp .sp 2 .(b L \fBsm_FetchNext(cursor, retKey, retElem, eof) SMCURSOR* cursor; /* IN cursor from sm_Fetch(\ ) */ KEY* retKey; /* OUT returned key (optional) */ void* retElem; /* OUT elem */ BOOL* eof; /* OUT to TRUE if EOF reached */\fR .)b .(x z sm_FetchNext(\ ) .)x \*($n .lp Sm_FetchNext(\ ) fetches the next element returned by a query. The element is returned in the structure addressed by \*(lqretElem\*(rq. A copy of the key can also be returned to the caller. If \*(lqretKey\*(rq is NULL, no key is returned. If \*(lqretKey\*(rq points to a key structure, the key is returned in that structure. The \*(lqlength\*(rq field in the key structure must indicate amount of space available in the target of the \*(lqvaluePtr\*(rq field. This must be enough for the longest key in the index. The caller is responsible for allocating space for \*(lqretKey\*(rq and \*(lqretElem\*(rq. .lp sm_FetchNext(\ ) returns FALSE in \*(lqeof\*(rq if an element is returned. If there are no more elements that satisfy the query, TRUE is returned in \*(lqeof\*(rq. .sp .br .sh 2 "Advanced Topics" .sh 3 "External Two-Phase Commit Functions" .(x z two-phase commit functions, external .)x \*($n .(x z two-phase commit protocol, external .)x \*($n .(x z transactions, distributed .)x \*($n .(x z distributed transactions .)x \*($n .lp The Storage Manager can particpate in transactions coordinated by other software modules that employ the two-phase commit \*(lqpresumed abort\*(rq transaction semantics and protocol. (For the purpose of this section, the reader is assumed to be familiar with the \*(lqpresumed abort\*(rq protocol.) The coordinator in such a situation is external to the Storage Manager; it is assumed to have its own stable storage, and it is assumed to recover from failures in a \fIshort time\fR (the precise meaning of which is given forthwith). .lp A prepared transaction, like an active transaction, consumes log space on one or more Exodus servers, beginning at a fixed location in each log. A Storage Manager server's log is like a circular buffer; it wraps and reuses the beginning of the log. If long-running or prepared transactions are still in the system, the server eventually tries to re-use log space consumed by the oldest transaction, at which point it effectively runs out of log space. A coordinator must resolve its prepared transactions before the servers run out of log space. The amount of time involved is a function of the size of the log on the participating servers and the load on those servers. .lp For the purpose of this discussion, the portion of a global transaction that involves a single Exodus Storage Manager transaction is called a \fIthread\fR .(x z thread .)x \*($n of the global transaction. Each thread has, in addition to its local transaction identifier, a global transaction identifier. Global transaction identifiers are provided by the application or some external authority, and must be unique. A global transaction identifier has type GTID, defined in \fCsm_client.h\fR, as follows: .(b I \fB#define MAXOPAQUELEN 255 \fBtypedef struct { int length; /* maximum MAXOPAQUELEN bytes */ u_char opaque[MAXOPAQUELEN]; } GTID; .)b .(x z transaction identifier, global .)x \*($n .(x z global transaction identifier .)x \*($n .(x z GTID .)x \*($n .lp The Storage Manager does not interpret the contents of the opaque part of the global transaction identifier. .lp An application that invokes the external two-phase commit protocol can find itself in any of the transaction states mentioned in Section 4.3.2 (\*(lqTransaction States\*(rq). It can also find itself in the PREPARED state after a call to sm_PrepareTransaction(\ ). An application in PREPARED state calls sm_CommitTransaction(\ ) or sm_AbortTransaction(\ ) to complete the transaction and return to the INACTIVE state. .lp While the coordinator for a global transaction is external to the Storage Manager, a single Storage Manager server corresponds with the client library and coordinates the Storage Manager servers that participate in the thread. If the application should crash during a two-phase commit, a new application program (representing the global coordinator) must run, and it must contact the Storage Manager that is acting as the thread's coordinator. In order to locate the proper server, a two-phase commit process begins by informing the client library that a transaction is a thread of a global transaction, and by identifying the thread's coordinator. The function sm_Enter2PC(\ ), described below, accomplishes this. .sp .(b L \fBsm_Enter2PC (tid, gtid, handle) TID tid; /* IN transaction ID */ GTID *gtid; /* IN global transaction ID */ COORD_HANDLE *handle; /* OUT for use if client crashes */ \fR .)b .(x z sm_Enter2PC(\ ) .)x \*($n .lp The application supplies the local and global transaction identifiers. The client library identifies a thread coordinator, and produces a handle for the application to write to stable storage. The handle identifies the thread coordinator; it is used by sm_Recover2PC(\ ) if the client crashes before the two-phase commit is completed. .lp The handle must be written to stable storage before the first phase of the commit begins, otherwise the application and Storage Manager may not be able to recover from a subsequent application failure. .sp .(b L \fBsm_PrepareTransaction (tid, vote) TID tid; /* IN transaction ID */ VOTE *vote; /* OUT result of first phase */ \fR .)b .(x z sm_PrepareTransaction(\ ) .)x \*($n .lp The application calls sm_PrepareTransaction(\ ) to begin the first, or prepare, phase of a two-phase commit. sm_PrepareTransaction(\ ) determines if the participating servers are able to commit the transaction, and directs them to prepare to commit if they are. If any of the participating servers is unable to commit the transaction, the vote returned is NOVOTE, sm_PrepareTransaction(\ ) sets sm_error to esmTRANSABORTED, sm_reason to esmTRANSNOTPREPARED, and returns esmFAILURE; the application must call sm_AbortTransaction(\ ). .lp If all participating servers are able to commit, and any of them logged updates during the transaction, the vote is YESVOTE, and the transaction state becomes PREPARED. If the transaction did not update any data on any of the servers, the vote is READVOTE, and the transaction state becomes INACTIVE. Sm_PrepareTransaction(\ ) returns esmNOERROR if the transaction becomes prepared (all servers vote YESVOTE) or committed (all server vote READVOTE). .lp If an error occurs during the prepare phase, sm_PrepareTransaction(\ ) returns esmFAILURE. If it is a recoverable error, the client library returns an error code specific to the error in sm_errno (such as esmTRANSDISABLED if a server is performing recovery), and the application can try again to call sm_PrepareTransaction(\ ). Some errors, on the other hand, cause the transaction to be aborted, in which case sm_PrepareTransaction(\ ) returns esmTRANSABORTED in sm_errno, and a vote of NOVOTE. .(x z vote, two-phase commit .)x \*($n .(x z esmTRANSABORTED .)x \*($n .lp If an application crashes during the first phase, the application must retry the prepare phase and complete the transaction. If it does not retry the prepare phase, and the transaction was indeed prepared before the application crashed, the prepared transaction consumes resources indefinitely, and eventually its servers will run out of log space. .lp Once a transaction is prepared, an application must invoke the second phase by aborting or committing the transaction (calling sm_AbortTransaction(\ ) or sm_CommitTransaction(\ ), respectively). It is an error to commit a global transaction thread without first preparing the transaction, and it is an error to do anything else without completing the second phase. .lp When an error occurs during the second phase, the application cannot tell if the second phase completed (the transaction indeed committed or aborted). It is alway safe to try again to complete the transaction by calling sm_AbortTransaction(\ ) or sm_CommitTransaction(\ ) again. .lp If the second phase fails because the network connection between the client and the thread coordinator breaks (esmSERVERDIED or esmNOTCONNECTED), the client must reconnect to the thread coordinator before the second phase can be finished. The following function does that: .sp .(b L \fBsm_Continue2PC (tid, willing2block) TID tid; /* IN transaction ID */ BOOL willling2block; /* IN ok to block indefinitely */ .)b .(x z sm_Continue2PC(\ ) .)x \*($n .lp If \*(lqwilling2block\*(rq is TRUE, the client library blocks until it connects to the thread coordinator. If this is inappropriate for the application, \*(lqwilling2block\*(rq must be FALSE, and the client library tries once to contact the thread coordinator. .lp If the application crashes, its replacement must use sm_Recover2PC(\ ), below, instead of sm_Continue2PC(\ ) to resolve the transaction. .sp .(b L \fBsm_Recover2PC (gtid, handle, willing2block, tid) COORD_HANDLE *handle; /* IN handle for thread coordinator */ GTID *gtid; /* IN global transaction ID */ BOOL willing2block; /* IN ok to block indefinitely */ TID *tid; /* OUT local transaction ID */\fR .)b .(x z sm_Recover2PC(\ ) .)x \*($n .lp When the application crashes (exits) after a transaction is prepared but before its second phase is completed, a \*(lqrecovery\*(rq application program must be run within a short time to finish the two-phase commit and resolve the transaction. This recovery application must use sm_Recover2PC(\ ), supplying the global transaction identifier and the handle returned by sm_Enter2PC(\ ) for that global transaction. The client library contacts the server identified in the handle, which conveys to the client library all that is needed for the application to enter or to retry the second phase. The transaction's local transaction identifier is returned by sm_Recover2PC(\ ) for the application to use in its subsequent call to sm_CommitTransaction(\ ) or sm_AbortTransaction(\ ). .lp The thread coordinator may not be available, in which case the client library keeps trying to connect or it will return an error (such as ECONNREFUSED), depending on the value of \*(lqwilling2block\*(rq. If \*(lqwilling2block\*(rq is FALSE, the client library tries only once to connect the thread coordinator. .br .sh 3 "Administrative Operations" .lp The following functions can be applied to one or more servers. Each function takes two arguments that determine which servers are of interest. The first argument is of type FLAGS, and takes one of the following values: .sp .(b \fRVOL_ALL /* the servers for all volumes */ VOL_USED_SINCE_INIT /* servers for all volumes used */ VOL_USED_IN_TRANSACTION /* servers used in this transaction */ VOL_BY_VOLID /* the second argument applies */ .)b The client library keeps a list of volumes and the servers that manage those volumes. The list is created from the information given in the configuration files and information passed to the library .(x z configuration files .)x \*($n through sm_SetClientOption(\ ), The flag VOL_ALL directs the client library to apply the administrative operation to the server that manages each volume in its list of known volumes. The flag VOL_USED_SINCE_INIT directs the client library to apply the administrative operation to each server contacted since sm_Initialize(\ ) was called. The flag VOL_USED_IN_TRANSACTION directs the client library to apply the administrative operation to each server contacted so far for participation in the current transaction. (It does not apply to servers to be contacted for the first time later in the transaction.) The flag VOL_BY_VOLID directs the client library to apply the administrative operation to the server that manages the volume identified by the second argument. The second argument is a volume identifier VOLID, which is ignored when the flags argument is VOL_ALL, VOL_USED_SINCE_INIT, or VOL_USED_IN_TRANSACTION. .lp Ideally the administrative operations would only be performed by trusted clients, but the Storage Manager does not restrict their use. .sp .(b L \fBsm_TakeCheckpoint (flags, volid, numCheckpoints) FLAGS flags; /* IN which servers are of interest */ VOLID volid; /* IN which server is of interest */ short numCheckpoints; /* IN number of checkpoints to take */\fR .)b .(x z sm_TakeCheckpoint(\ ) .)x \*($n .lp Sm_TakeCheckpoint(\ ) sends a request to the server to take a number of checkpoints. In most circumstances, a value of one for the \*(lqnumCheckpoints\*(rq argument is appropriate. A value greater than 1 can be used to ensure that the server flushes all pages that were dirty when the first checkpoint was taken. (This is useful for experimenting with the recovery facility). .sp .(b L \fBsm_ChangeCheckpointFrequency (flags, volid, frequency) FLAGS flags; /* IN which servers are of interest */ VOLID volid; /* IN which server is of interest */ int frequency; /* IN number of log records between checkpoints */\fR .)b .(x z sm_ChangeCheckpointFrequency(\ ) .)x \*($n .lp Sm_ChangeCheckpointFrequency(\ ) changes the frequency of checkpoints taken by the server. The checkpoint frequency is based on the number of log pages written. .(x z checkpoint frequency, changing .)x \*($n .(x z default checkpoint frequency .)x \*($n More information about checkpoint frequency can be found in Section 5.3, \fBTuning the Server\fR. .sp .(b L \fBsm_ShutdownServer (flags, volid, options) FLAGS flags; /* IN which servers are of interest */ VOLID volid; /* IN which server is of interest */ FLAGS options; /* IN shutdown options */\fR .)b .(x z sm_ShutdownServer(\ ) .)x \*($n .lp Sm_ShutdownServer(\ ) directs servers to shut down. The \*(lqoptions\*(rq argument indicates what a server should do before exiting. The following flags are available: NOFLAGS, SHUT_TAKE_CHECKPOINT, SHUT_DUMP_CORE, SHUT_ABORT_TRANS, SHUT_COMMIT_TRANS, SHUT_CLEAN_VOLUMES. These can be combined with the logical \*(lqor\*(rq operator. .lp If NOFLAGS is given, the server kills the disk processes and exits. .lp SHUT_TAKE_CHECKPOINT directs the server to take a checkpoint before exiting. .lp SHUT_DUMP_CORE directs the server to dump a core file debugging (see core(5)). .lp SHUT_COMMIT_TRANS directs the server to wait until the running transactions either commit or abort before it shuts down. .lp SHUT_ABORT_TRANS directs the server to abort all running transactions before shutting down. When SHUT_COMMIT_TRANS or SHUT_ABORT_TRANS is used, clients cannot start any new transactions. .lp SHUT_CLEAN_VOLUMES directs the server to write dirty pages to disk before exiting. To shut down a server after which recovery is not required, use either SHUT_COMMIT_TRANS | SHUT_CLEAN_VOLUMES or SHUT_ABORT_TRANS | SHUT_CLEAN_VOLUMES. .sp .(b L \fBsm_ServerStatistics (flags, volid, numServers, stats, reset) FLAGS flags; /* IN which servers are of interest */ VOLID volid; /* IN which server is of interest */ int *numServers; /* OUT # servers contacted */ SERVERSTATS **stats; /* OUT servers' statistics */ BOOL reset; /* IN TRUE = reinitialize counters */\fR .)b .(x z sm_ServerStatistics(\ ) .)x \*($n .lp Sm_ServerStatistics(\ ) obtains statistics about one or more servers. For each server contacted, a set of statistics is returned. The client library allocates space for the statistics, and the \fBapplication is responsible for freeing that space\fR ( see the manual page for malloc(3) ). The \*(lqflags\*(rq indicate which servers are of interest, and the number of servers contacted is returned in \*(lq*numServers\*(rq. On return from sm_ServerStatistics(\ ), the \*(lq*stats\*(rq pointer addresses an array of \*(lq*numServers\*(rq SERVERSTATS structures. This array must be freed by the application with one call to \fIfree(3)\fR. .lp If \*(lqreset\*(rq is TRUE, the statistics labeled as counters below are reset to zero. .lp The SERVERSTATS structure looks like this: .(b I \fBtypedef struct { int numClients; /* # clients connected */ int numTrans; /* # transactions in progress */ int numVolumes; /* # volumes mounted */ int freeLogSpace; /* approximate # bytes free log space */ int chpntFreq; /* checkpoint frequency */ int totalCommits; /* # transactions committed */ int totalAborts; /* # transactions aborted */ int diskReads; /* # disk reads */ int diskWrites; /* # disk writes */ MESSAGESTATS msgStats; /* server's message counters */ } SERVERSTATS; .)b .(x z MESSAGESTATS .)x \*($n .(x z SERVERSTATS .)x \*($n .lp The MESSAGESTATS structure contains statistics about the client-server protocol and the server-server protocol. A set of these statistics is kept by the client library a set is kept by each server. The client library's statistics are found in the global structure .(b L \fBextern MESSAGESTATS MsgStats;\fR .)b The MESSAGESTATS structure contains the following counters for each message type: messages sent, messages received, replies received with an error indication, replies received with no error, messages sent with no reply requested. The counters for replies have two different meanings, depending on which set statistics is concerned. The servers count the replies \fIsent\fR with and without error indications, and the number of requests that the server \fIreceived\fR that did not require a reply at all. The client library counts the replies \fIreceived\fR with and without error indications, and the number of requests that the client \fIsent\fR that did not require a reply at all. .lp The following function prints the MESSAGESTATS structure: .sp .(b L \fBsm_PrintMessageStats (file, stats) FILE *const file; /* IN where to print */ MESSAGESTATS *const msgStats; /* IN what to print */ .)b .(x z sm_PrintMessageStats(\ ) .)x \*($n .lp The following function tells if a mounted volume is temporary volume, a data volume, or a log volume. .(x z temporary volume .)x \*($n See Section 5.1, \fBManaging Volumes\fR, for information about volumes. .sp .(b L \fBsm_VolumeProperties (volid, properties) VOLID volid; /* IN which volume is of interest */ int *properties; /* OUT the properties */ .)b .(x z sm_VolumeProperties(\ ) .)x \*($n .lp Sm_VolumeProperties(\ ) returns a set of bits that tell whether the given volume is a data volume or a temporary volume. The \*(lqvolid\*(rq argument is the volume identifier of the volume in question. If the volume is not mounted when Sm_VolumeProperties(\ ) is called, Sm_VolumeProperties(\ ) mounts it. .lp VOLPROP_TEMP indicates that the volume is temporary .(x z temporary volume .)x \*($n (see Section 5.1.3, \fBTemporary Volumes\fR). If the bit VOLPROP_TEMP is not set in the result, the volume is a data volume. A log volume cannot be mounted by a client, and an attempt to get a log volume's properties results in an error. .sp .(b L \fBsm_AddServerVolume (flags, volid, option, value) FLAGS flags; /* IN which servers are of interest */ VOLID volid; /* IN which volume is of interest */ char *option; /* IN which format option to use */ char *value; /* IN value for the format option */ .)b .(x z sm_AddServerVolume(\ ) .)x \*($n .lp Sm_AddServerVolume(\ ) adds a volume to the list of mountable volumes on one or more servers (although it seldom makes sense to do this on more than one server with a single pair of arguments). The \*(lqflags\*(rq argument indicates which servers are of interest. The \*(lqvolid\*(rq argument is the volume identifier of the volume that will determine which server to contact when \*(lqflags\*(rq == VOL_BY_VOLID. The \*(lqoption\*(rq is one of the server's format options (\*(lqdataformat\*(rq or \*(lqtempformat\*(rq). The \*(lqvalue\*(rq argument is the value to be given the option named in \*(lqoption\*(rq. .lp Sm_AddServerVolume(\ ) adds the named volume to the server's list of known volumes, but the server does not try to mount the volume or verify that the volume exists or is valid. Sm_AddServerVolume(\ ) fails if the value given conflicts with another volume already in the server's table, either in the path name or the volume identifier. If your objective is to change the format information for a path name that is in the server's table, first remove the existing format information (using sm_RemoveServerVolume(\ ), described below), and subsequently add the new information. .sp .(b L \fBsm_RemoveServerVolume (flags, volid, volid2remove) FLAGS flags; /* IN which servers are of interest */ VOLID volid; /* IN which volume id of server of interest */ VOLID volid2remove; /* IN which volume to remove */ .)b .(x z sm_RemoveServerVolume(\ ) .)x \*($n .lp Sm_RemoveServerVolume(\ ) removes \*(lqvolid2remove\*(rq from one or more servers' lists of mountable volumes. The volume cannot be removed from a server's table while the volume is in use. it must be dismounted before it is removed. .lp See also Section 5.1, \fBManaging Volumes\fR. .br .sh 3 "Tuning the Application" .lp The size of the application's buffer pool, determined by the \*(lqbufpages\*(rq option, is the primary tuning parameter that is under the control of applications. The \*(lqbufpages\*(rq option indicates the number of MIN_PAGESIZE pages in the buffer pool. It should be set large enough to hold the application's working set of objects. The buffer pool must not exceed the size of physical memory available to the client. .bp .sh 1 "USING STORAGE MANAGER SERVERS" .lp Storage Manager servers provide disk, file, transaction, concurrency control, and recovery services to clients. In most respects, users do not have to understand how servers work, but there are a few things that administrators should know; we focus on those things in this section. The first half of this section explains how to manage volumes. The second half explains how to operate a server. .sp .sh 2 "Managing Volumes" .lp Servers store data on \fIvolumes\fR, .(x z volume .)x \*($n .(x z files, Unix .)x \*($n .(x z partition .)x \*($n which can be Unix files or raw disk partitions. Each server is composed of a server process and one \fIdisk process\fR for each mounted volume. .(x z disk process .)x \*($n When a server requires I/O, it asks the appropriate disk process to read from or write to the server's buffer pool, which is located in a Unix System V shared-memory segment. The disk processes perform I/O so that the server never blocks when I/O is required. The server mounts a volume before using it, and the server dismounts the volume when it is no longer in use. Mounting a volume consists in forking a disk process for that volume. Dismounting the volume consists in flushing all dirty pages to the disk and killing the volume's disk process. .lp Volumes are created with the \fCformatvol\fR program, which establishes a volume's identifier, size, type, and other characteristics. Volumes come in three types: log volumes, data volumes, and temporary volumes. .(x z temporary volume .)x \*($n .sh 3 "Log Volumes" .lp Log volumes are used to store log information for aborting transactions and for recovery. The server has one log volume mounted at all times. .sh 3 "Data Volumes" .lp Data volumes are used to store objects and indexes that are meant to exist after a transaction ends. Changes to data volumes are logged so that transactions can be aborted or committed with reliability, and so that recovery can be performed after a crash. .sh 3 "Temporary Volumes" .lp Some applications store temporary private data and do not need concurrency control or recovery. The Storage Manager provides temporary volumes for this purpose. .(x z temporary volume .)x \*($n Locks are not acquired for data in temporary volumes, and updates to temporary volumes are not logged. Temporary volumes are less costly to use than data volumes are, but the data on them cannot be shared among transactions. The data on temporary volumes are deleted at the conclusion of the transaction that creates them, regardless of whether the transaction is committed or aborted. Temporary volumes cannot contain root entries. .lp The server can serve many data volumes and temporary volumes simultaneously. .sh 3 "Raw Partitions and Unix Files" .lp A volume can be a Unix file or a Unix raw partition. When a raw partition is used, data are transferred between the server's buffer pool and the disk by the disk process, bypassing the Unix file system's buffer pool. .lp When a Unix file is used, the data are written to the Unix file system's buffer pool, and the operating system worries about flushing the data to the disk. In this case, the server forces the data to the disk periodically with a Unix \fIfsync(\ )\fR system call. .br .sh 3 "Formatting Volumes" .lp Before a volume can be used, it must be formatted. This is done using the \fCformatvol\fR program, which can also display information about previously formatted volumes. Formatvol uses .(x z configuration options .)x \*($n the configuration options \*(lqdataformat\*(rq, \*(lqtempformat\*(rq, and \*(lqlogformat\*(rq to determine what characteristics to give volumes that it formats. The options have values that list the following information: .ip "path" 10 The Unix path name of the volume, e.g., \fC/dev/rz2c\fR. .ip "volid" 10 The volume identifier for this volume, an integer, e.g., 8000. .ip "#cyl" 10 The number of cylinders on this disk, e.g., 1224 for a DEC RZ55. May be 1. .ip "#trk/cyl" 10 The number of tracks per cylinder e.g., 15 for a DEC RZ55. May be 1. .ip "#sect/trk" 10 The number of sectors or blocks per track e.g., 36 for a DEC RZ55. May be the number of \fIblocks\fR in the file. .(x z block in a file .)x \*($n A block is MIN_PAGESIZE bytes; MIN_PAGESIZE is defined in \fCsm_client.h\fR. (This is determined by the Storage Manager, not by the device.) \** .(f \** The format of a volume does not affect performance with most modern disks. The easiest way to format volumes it to use use 1 cyl, 1 track/cyl, and let the sect/trk account for the size of the entire volume. .)f .ip "#KB/pg" 10 \fBFor logformat only\fR. This gives the page size for log pages, in kilobytes. The value given here may be 4 or larger, and must be a power of 2. .lp Formatvol collects the format information from the options in the configuration files, after which it determines which volumes to format or to display by processing the options \*(lqvolume\*(rq and \*(lqdisplay\*(rq from the command line. The options that formatvol understands are summarized in Table 2. .(b .TS box, center, tab(;); c|c|c c|c|c l|l|l. Option;Option;Option Name;Type;Description _ tempformat;string,int,int,int;path,volid,#cyl,#trk/cyl,#sect/trk dataformat;string,int,int,int;path,volid,#cyl,#trk/cyl,#sect/trk logformat;string,int,int,int,int;path,volid,#cyl,#trk/cyl,#sect/trk,#KB/pg volume;int;volume to format - command line only display;int;volume to display - command line only .TE .ce 2 \fBTable 2: Formatvol Options.\fR Fields are separated by white space, commas, colons or semicolons. .)b .(x z options, formatvol .)x \*($n .sp For example, to print information about the volumes with volids 8000 and 4000 use: .(b I \fCformatvol -dis 8000 -dis 4000\fR .)b .lp To format a data volume with volid 8000 and a temporary volume with volid 4000 use: .(x z temporary volume .)x \*($n .(b I \fCformatvol -vol 8000 -vol 4000\fR .)b .lp Formatting a volume writes a volume header and initializes the bitmaps that describe the free blocks on the volume. A volume that is reformatted after being used loses all its data. .lp The Storage Manager does not prevent a volume from being formatted while it is in use by a server, even though \fBit will cause the server to crash unrecoverably\fR. Be certain that a volume is not mounted before you format it! \** .(f \** The Storage Manager ought to lock volumes with Unix file locks, but Unix does not provide an adequate mechanism for locking and unlocking files in the context of crash recovery. .)f A volume is unmounted when all clients that are using the volume have completed transactions on it and have unmounted it. (A client may unmount a volume explicitly with sm_DismountVolume(\ ), or by shutting down with sm_ShutDown(\ ) or \fIexit(\ )\fR.) .lp During recovery, a server mounts the volumes that need recovery. The volumes are dismounted when recovery is completed. If a volume was in use at the time its server crashed, \fBdo not reformat the volume until a new server recovers the data on that volume\fR. If you do, the server's log will be inconsistent with the data on the volume, and the server will crash during recovery, and it will be unable to recover from that. You can reformat the data volumes and the log volume to get a server running again, but you will have lost all data on the volumes. .lp The log volume is mounted whenever the server is running, so a log volume can be formatted ONLY when the server is not running. .br .sh 3 "Size Requirements for Log Volumes" .lp How large should a log volume be? .(x z log volume, size of .)x \*($n .(x z log space .)x \*($n The answer depends on the expected transaction mix. More specifically, it depends on the age of the oldest (longest running) transaction in the system and the amount of log space used by all active transactions. Here are some general rules to determine the amount of free log space available in the system. .np The physical log is circular. Log space between the first log record generated by the oldest active transaction and the most recent log record generated by any transaction cannot be reused. .np Log space for a transaction is available for reuse when the transaction has committed or completely aborted. Aborting a transaction causes log space to be used, so space is \fIreserved\f for aborting each transaction. Enough log space must be available to commit \fIor abort\fR all active transactions at all times. .np Only space starting at the \fIbeginning\fR of the log can be reused. This space can be reused if it contains log records only for transactions meeting rule 2. .np All sm_WriteObject(\ ) calls require log space twice the size of the space written in the object. All calls that create, grow, or shrink objects require log space equal to the size created, inserted, or deleted. Log records generated by these calls (generally one per call) have an overhead of approximately 50 bytes. .np File operations are logged, but the space requirements for them are most often negligible, since they are relatively rare operations, and are often performed in short transactions. .np The amount of log space \fIreserved\fR for aborting a transaction is equal to the amount of log space generated by the transaction (for the purpose of committing the transaction). .np When insufficient log space is available for a transaction, the transaction is aborted. .np The log should be at least 1 Mbyte (250 pages). .lp For example, consider a transaction T1, which creates 300 objects of size 2,000 bytes, writes 20 bytes in 100 objects, and is committed. T1 requires at 615 Kbytes for the creates and 9 Kbytes of log space for the writes. Since log space must be reserved to abort the transaction, the log size must be over 1.248 Mbytes to run this transaction. Assuming T1 is the only transaction running in the system, all the log space it uses and reserves becomes available when it completes. If another transaction, T2, is started at the same time as T1, but is still running after T1 is committed, only the reserved space for T1 is available for other transactions. The portion of the log used by T1 and T2 is not available until T2 is finished. .lp Transactions that fail because of insufficient log space are commonly those that load a large number of objects into a file during the creation of a database. A solution to this problem is to load the file in a series of smaller transactions. When the last transaction is committed, the load is complete. If the load needs to be aborted, a separate transaction is run to destroy the file. .br .sh 3 "Backing Up Volumes" .lp The Storage Manager does not support media recovery, .(x z volumes, backing up .)x \*($n so backing up critical data volumes is wise. A volume may be backed up when it is unmounted and needs no recovery. If a volume is stored on a Unix file, a simple copy of the file can be used as a backup. For volumes stored on a raw disk partition, the Unix \fIdd(1)\fR command can be used to backup the volume to a Unix file and to restore it. For example, to save a copy of the raw device \fC/dev/rrz4d\fR in the Unix file backup.rrz4d use: .(b \fCdd if=/dev/rrz4d of=backup.rrz4d\fR. .)b To restore the backup, use: .(b \fCdd if=backup.rrz4d of=/dev/rrz4d\fR. .)b .sp .sh 2 "Using the Server" .lp In this section we explain how to operate a Storage Manager server. For the purpose of this discussion, we use only one server, although any number of servers can be used to manage any number of volumes. We begin with starting and configuring the server. Next, we discuss what the server does during normal operation. We follow this with instructions for shutting the server down. Finally, we explain how the server recovers from failure. .sh 3 "Starting the Server" .lp The server is composed of two executable files: \fCsm_server\fR and \fCdiskrw\fR. .(x z disk process .)x \*($n \fCSm_server\fR is the main server program. \fCDiskrw\fR is started by the server, as a separate process for each mounted volume, for performing asynchronous disk I/O. These processes communicate with the server through sockets, semaphores, and shared memory. By default, the server assumes \fCdiskrw\fR is located in the user's path. .(x z default path for diskrw .)x \*($n An option, described below, can be used to change this assumption. .lp When the server is started, it processes configuration options. .(x z configuration options .)x \*($n These options are discussed further below. Second, the server allocates the buffer pool. The buffer pool is located in shared memory, so the operating system must have shared-memory support. Furthermore, the machine on which the server runs must have enough shared memory to accommodate the entire buffer pool. If not enough shared memory is available, the server prints a message, indicating how much shared memory it is trying to acquire, and exits. .lp Third, the server mounts the log volume. .(x z log volume .)x \*($n .(x z regenerating log volume .)x \*($n .(x z log volume, regenerated .)x \*($n If the log volume is newly formatted, it is \fIregenerated\fR. When a log volume is regenerated, the entire log is cleared and written to disk. This will take noticeable time if the volume is large. If the log is not regenerated, recovery analysis is performed. .lp If no volumes require recovery, .(x z recovery .)x \*($n all phases of recovery complete in less than one second. If the analysis determines that any volumes require recovery (due to a previous failure of some sort: operating system failure, machine failure, internal error, or because a user killed the server), recovery is performed. Data volumes that were mounted at the time of the failure are remounted, updates by committed transactions are restored, and all transactions in progress at the time of failure are aborted. When recovery is complete, the data volumes are dismounted and a checkpoint is taken. .lp The server now begin to process requests from clients. .br .sh 3 "Configuring the Server" .lp There are several \fIconfiguration options\fR that .(x z configuration, options for server .)x \*($n can be set when the server is started. A brief description of the options is given in Table 3. Most options have default values, but some do not, and these \fImust\fR be given values, either on the command line or in a configuration file. See Section 3 for general information that applies to all options. .(z .(x z default, option values .)x \*($n .sz -2 .TS box, center, tab(#); c|c|c|c|c c|c|c|c|c l|l|l|l|l. Option#Option#Possible#Default#Option Name#Type#Values#Values#Description _ config#string#file name#/usr/lib/sm_config#read a configuration file ###$HOME/.sm_config#defaults is read unless ###./.sm_config#skipdefault is set verbose#Boolean#yes no#no#print configuration options bufpages#int#> 32#none#number of buffer pool pages logvolume#string#path name#none#name of the log volume portname#string#name or number#exodussm#port name or port number ####for a server; if a name, it ####must be in \fC/etc/services\fR errorfile#string#file name#- (stderr)#file for errors, ####warnings, progress regenlog#Boolean#yes no#no#clear the log, shutdown#Boolean#yes no#no#shut down after recovery ####or regeneration of log checkpoints#int#> 1#100#checkpoint frequency ####(based on number of log pages) diskproc#string#file name#/usr/lib/exodus/diskrw#disk I/O program name intercache#Boolean#yes no#yes#allow caching of pages ####at the client between ####transactions progress#Boolean#yes no#no#control progress printing maxclients#int#> 0#20#maximum number of ####clients to be served ####simultaneously maxthreads#int#> 1#function(maxclients)#maximum number of ####threads. traceflags#int#hex number#0x0#set tracing flags. ####Available if server is ####compiled with -DDEBUG. tempformat#string###see Table 2. dataformat#string###see Table 2. logformat#string###see Table 2. maxaddvolumes#int#small number >= 0#0#increases volume table size wrapcount#int#>=0#0#starting wrap count for log .TE .sz +2 .ce .uh "Table 3: Server Options" .(x z options, server .)x \*($n .)z .lp Option values are read from the the default configuration files \fC/usr/lib/sm_config\fR, \fC$HOME/.sm_config\fR, and \fC./.sm_config\fR in that order, if they exist. If the command-line option \*(lqskipdefault\*(rq is given, .(x z configuration files, skipping defaults .)x \*($n .(x z default configuration files, skipping .)x \*($n these default files are not read. .lp Options on the command line are read after the default files are read. Command-line options are prefixed by a \*(lq-\*(rq. In addition to options, a server accepts the command-line \fIflags\fR given in Table 4. Command-line flags are prefixed by a \*(lq-\*(rq. .(z .sz -2 .TS box, center, tab(#); c|c c|c l|l. Flag#Flag Name#Effect _ help#print a message and exit skipdefault#do not read default configuration files #must be the first argument on the command line force#do not confirm log regeneration option background#put in background (for use with Bourne shell) .TE .sz +2 .ce .uh "Table 4: Server Command-Line Flags" .(x z flags, server command-line .)x \*($n .)z .lp When given the \*(lqhelp\*(rq flag, a server prints a list of the available options and flags, and exits. .lp The \*(lqskipdefault\*(rq flag prevents a server from reading the default configuration files. It must be the first argument on the command line if it is used. .lp The \*(lqforce\*(rq flag prevents a server from checking with the user before regenerating the log. .lp The \*(lqbackground\*(rq flag causes the server to disconnect from its controlling terminal. This flag is available for users who run the server from shells that, like the Bourne shell, do not have real job control. .lp We now describe each option from Table 2. .lp The \*(lqconfig\*(rq option specifies a configuration file to read after default configuration files have been read. .(x z configuration file, which to read .)x \*($n This option is effective only on the command line. .lp The \*(lqverbose\*(rq option is used to turn on and off printing of the option values at startup. Options are printed to the file specified by \*(lqerrorfile\*(rq option (q.v.). .lp The \*(lqbufpages\*(rq option indicates the number of MIN_PAGESIZE pages to be used for a server's buffer pool. The option must be given for a server to run. This option determines the size of the shared memory segment allocated by the server. The shared memory segment will be MIN_PAGESIZE*bufpages bytes long plus a few KB extra. Section 5.3, \fBTuning the Server\fR, for more information about setting this option. .lp The \*(lqlogvolume\*(rq option gives the path name of the volume that contains the log. A value must be given for the log volume. .lp The \*(lqportname\*(rq option indicates a port number or the symbolic name of a port entry in \fC/etc/services\fR. The server connects to this port and listens for client requests on it. To enable clients to locate a server with a symbolic port name, the port name must to present in \fC/etc/services\fR on both the client and server machines. If no port name is given, a server looks for an entry \*(lqexodussm\*(rq, registered for use with TCP, in \fC/etc/services\fR. .lp By using port numbers instead of symbolic names avoids the need for entries in \fC/etc/services\fR. See the Unix manual page for services(5). An example entry for the default server name is: .(b \fCexodussm 1152/tcp # exodus storage manager\fR .)b .lp The \*(lqerrorfile\*(rq option directs server error messages and diagnostics to the given file. A value of \*(lq-\*(rq means that \fIstderr\fR is used. .lp The \*(lqregenlog\*(rq option causes the log on the log volume to be regenerated. \fBThis overwrites all log records, so it should not be done unless the server was last shut down cleanly\fR. Server automatically regenerate their logs when they are started with a newly formatted log volumes. When the option is set to \*(lqyes\*(rq, a confirmation is requested. The confirmation can be disabled by starting the server with the \*(lqforce\*(rq option. .lp The \*(lqshutdown\*(rq option causes a server to shut down immediately after performing recovery or regenerating the log. .lp The \*(lqcheckpoints\*(rq option sets the checkpoint frequency for a server. The value represents the number of log pages written between checkpoints. .lp The \*(lqprogress\*(rq option causes a server to print messages tracing its progress. This is used for debugging; it slows the server. .lp The \*(lqdiskproc\*(rq option specifies the path name of the disk I/O program to be used by the server. .lp The \*(lqintercache\*(rq option allows experiments to be run with and without inter-transaction caching of pages on the client. .lp The \*(lqmaxclients\*(rq option determines the number of clients a server can server at any one time. Servers create internal tables whose size depends on this value. .lp The \*(lqmaxthreads\*(rq value, determined by the \*(lqmaxclients\*(rq value, should be sufficient, but can be overridden. If a server recovers from a failure without running out of threads, it has enough threads to handle client requests. If numerous distributed transactions are active at the time .(x z transactions, distributed .)x \*($n .(x z distributed transactions .)x \*($n of a server failure, it is possible, but unlikely, that the server will not be able to recover with the default number of threads. .lp The \*(lqtraceflags\*(rq option is available only with a server that was compiled with debugging (the -DDEBUG flag). It is useful for programmers who are modifying the Storage Manager source code and testing their changes. .lp The \*(lqdataformat\*(rq, \*(lqlogformat\*(rq, and \*(lqtempformat\*(rq options are as described in Section 5.1.5, \fBFormatting Volumes\fR. Servers can mount and use volumes given in these options. .lp The \*(lqmaxaddvolumes\*(rq option indicates how large the mount table will be. The server reads its configuration files, counts the volumes named in the format options, and creates a mount table large enough to mount this many volumes and \*(lqmaxaddvolumes\*(rq more. This is a strict limit to the number of volumes that the server can mount (at any one time) as long as it is running. The value of \*(lqmaxaddvolumes\*(rq should not be boosted frivolously, because the size of the mount table affects the amount of shared memory required by the server. The default value is 0. .lp The \*(lqwrapcount\*(rq option is rarely needed. The server will tell you if you ever need to set this option. It is needed if you add volumes after the server starts (maxaddvolumes > 0), and a volume that you are add was updated by a server running on a log that differs from the current log (or the log was regenerated since the added volume was last mounted.) .sp .sh 3 "Normal Operation of Servers" .lp During normal operation, servers listen for connections and requests from clients and monitor terminal input. Error messages are printed on the servers terminals when interesting events occur, for example, when a deadlock is detected, or a transaction is aborted by a server because of a problem such as insufficient log space. .sh 4 "Server Commands" .lp The following commands can be invoked from the standard input to the server: \*(lqhelp\*(rq, \*(lqshutdown\*(rq, \*(lqkill\*(rq, \*(lqcrash\*(rq, \*(lqcheckpoint\*(rq, \*(lqprintstats\*(rq, \*(lqclearstats\*(rq, \*(lqprogress\*(rq, \*(lquser\*(rq, \*(lqaddvolume\*(rq, \*(lqrmvolume\*(rq, \*(lqlistvolumes\*(rq, \*(lqlistmount\*(rq, \*(lqlistdistr\*(rq, \*(lqsource\*(rq, \*(lqredirect\*(rq. When the server is compiled with profiling (-DPROFIL, -p), the server accepts the \*(lqprofil\*(rq command. When the server is compiled with debugging (-DDEBUG), the server also accepts the \*(lqtraceflags\*(rq and \*(lqtracelevel\*(rq commands. \." TODO: add tracelevel as regular option .lp The \*(lqhelp\*(rq command provides a list of the commands. .lp The \*(lqshutdown\*(rq command instructs the server to abort all active transactions and cleanly shut down. The \*(lqkill\*(rq command causes the server to halt immediately after displaying the status of mounted volumes. The \*(lqcrash\*(rq command has the same effect as the \*(lqkill\*(rq command, except that a core dump is produced as well. .lp The \*(lqcheckpoint\*(rq command causes the server to take a checkpoint immediately. Checkpoints are taken periodically by servers. The default frequency is once every 100 log pages, but this .(x z checkpoint .)x \*($n .(x z checkpoint frequency, default .)x \*($n .(x z default checkpoint frequency .)x \*($n can be changed by an application program (see sm_ChangeCheckpointFrequency(\ ) in Section 4.11.2, \fBAdministrative Operations\fR). .lp The \*(lqprintstats\*(rq command prints general server statistics. The \*(lqclearstats\*(rq command clears any counters among the statistics. .lp The \*(lqprogress\*(rq command reverses the value of the \*(lqprogress\*(rq option. .lp The \*(lquser\*(rq command reverses the value of an internal flag that determines whether or not the server prints a message when a user (application) error is encountered. (There is no option to control this.) .\" TODO: add a regular 'user' option .lp The \*(lqaddvolume\*(rq command adds a volume to the server's table of mountable volumes. The \*(lqaddvolume\*(rq command takes a format-option name and a format-option value. For example, to add the data volume 8000, type .(b \fCaddvolume dataformat /path/to/datafile:8000:1:1:300\fR .)b A volume cannot be added if the given format information conflicts with other information in the table. .lp The \*(lqrmvolume\*(rq command removes a volume from the server's table of mountable volumes. The command takes a volume identifier. For example, to remove the data volume 8000, type .(b \fCrmvolume 8000 .)b A volume cannot be removed if it is in use. .lp The \*(lqlistvolumes\*(rq command prints the server's table of mountable volumes. .lp The \*(lqlistmount\*(rq command prints a list of the volumes that are in some state of use: mounted, being mounted or being dismounted. It also prints the number of free \*(lqmount slots\*(rq, which indicates how many more volumes could be mounted at any one time, given the server's configuration. To allow more volumes to be mounted at once, shut the server down, boost the value of the \*(lqmaxaddvolumes\*(rq option, and restart the server. .lp The \*(lqlistdistr\*(rq command prints information about prepared distributed transactions. .(x z transactions, distributed .)x \*($n .(x z distributed transactions .)x \*($n These transactions consume space in the log, and if they are not aborted or committed, eventually the server will fail because it will have run out of log space. .(x z log space .)x \*($n See Section 4.3, \fBTransactions\fR, Section 4.11.1, \fBExternal Two-Phase Commit Functions\fR for information about distributed transactions. .lp The \*(lqsource\*(rq command takes one argument, the path name of a file from which to read commands. The server processes these commands, and when it reads the last command in the file, it resumes reading from the terminal. If the path name is missing or is \fC/dev/tty\fR, reading resumes from the terminal. .lp The \*(lqredirect\*(rq command takes two arguments. The first argument indicates which output stream is to be redirected: messages to the terminal or error messages. The second argument is the path name of a file to which the output is written. When the output is redirected again, the stream is flushed to the given file and the file is closed. To redirect output to the terminal, use \fC/dev/tty\fR or omit the path name. .lp The \*(lqprofil\*(rq command causes the server to dump its profiling information to disk. This command is available only on a server that was compiled with profiling on (-DPROFIL -p). See the manual page for prof(1). .lp The \*(lqtraceflags\*(rq command may take an integer argument, which may be a hexadecimal number, such as \*(lq0xfa3\*(rq, in which case it sets the server's trace flags word to that value. The command is available only with a server that was compiled with debugging on (-DDEBUG -g). The meanings of the trace flags are found in the server's source code, in \fCsrc/include/global_trace.h\fR. When \*(lqtraceflags\*(rq is used with no argument, it prints the value of the trace flags word. .lp The \*(lqtracelevel\*(rq command is available with a server that was compiled with debugging on (-DDEBUG -g). When used with no argument, it prints the trace level for the trace flags that are on. When given an integer argument (1, 2, or 3), it sets the trace level for the trace flags that are on. .br .sh 3 "Shutting Down the Server" .lp The server can be shut down several ways. One method is to use one of the above-mentioned commands. Another is to run the \*(lqshutserver\*(rq program, described below, at the end of this section. A third way to shut down a server is to call sm_ShutdownServer(\ ) in a client program. .lp A server may also shut itself down because of a fatal error, such as the unexpected death of a disk process or a bug. A fatal error causes the server to report the state of all the mounted volumes, dump core, and exit. .lp The server allocates a Unix System V shared-memory segment and a semaphore set when it starts. If a server is shut down in a controlled fashion, it removes the segment and semaphore set. These resources are not removed when the server is terminated by \fCkill -9 <server process>\fR typed in the shell, by the \*(lqkill\*(rq or \*(lqcrash\*(rq command given to the server's terminal monitor, or when the server process is killed by a debugger. \fBIf you use any one of these means to terminate a server, you must use ipcrm(1) to remove the resources.\fR See the manual pages for ipcs(1) and ipcrm(1) for more information. If the segments and semaphore sets are not removed, eventually the operating system will run out of segments, and you will be unable to start a new server. .lp If a server shuts down without having committed or aborted all its active transactions and flushed all its dirty pages to disk, recovery is required when the server is restarted. When a server shuts down, it prints the status of all the mounted volumes. It indicates if recovery is necessary on those volumes. .br .sh 4 "Running the Shutserver program" .lp The \fCshutserver\fR program is invoked: .(b \fCshutserver [-m machine] [-s servername] [-h]\fR. .)b The \*(lqmachine\*(rq specifies the name of the machine on which runs the server to be shut down. If \*(lq-m machine\*(rq is not given, the program uses the machine on which \fCshutserver\fR is executed. The \*(lqservername\*(rq is the name of the server in \fC/etc/services\fR, .\" TODO: this should be -port name. If \*(lq-s servername\*(rq is not given, \*(lqexodussm\*(rq is used. The \*(lq-h\*(rq option prints a brief help message. .br .sh 3 "Recovery" .lp When a server is started after a failure it automatically performs recovery. The time it takes for recovery depends on several factors, including the number of transactions in progress at the time of the failure, the number of log records generated by these transactions, and the number of log records generated since the last checkpoint. .lp Recovery has three phases. .(x z recovery .)x \*($n After each phase, the server prints information about the time and I/O operations required to perform the phase. .lp The first phase is \fIanalysis\fR. The log is scanned to determine what transactions were active and which volumes were mounted at the time of the failure. .lp After analysis, the volumes are mounted and the \fIredo\fR phase is performed. In the redo phase, data are restored to their state at the time of the failure. .lp In the last phase, the \fIundo\fR phase, the server aborts the transactions that were active at the time of the crash. The volumes are dismounted, and a checkpoint is taken. .lp For details of recovery in the Storage Manager, see [Fran92]. .br .sh 2 "Tuning the Server" .lp There are several tuning parameters in the Storage Manager server. The following sections describe each one. .br .sh 4 "The Size of the Buffer Pool" .lp The size of a server's buffer pool is determined by the \*(lqbufpages\*(rq option, which indicates the number of MIN_PAGESIZE pages in the buffer pool. If a server is the primary process on a machine, it should have a buffer pool close to the size of available shared memory. When both an application and a server are running on the same machine, choosing a buffer pool size is more difficult. A \*(lqproper\*(rq choice depends on the behavior of the applications and their interactions with servers. A good rule of thumb is that that clients should have the adequate buffer space, to minimize client-server interaction. .lp The buffer pool must fit in the available shared memory of the machine on which the server runs. The server will let you know if it cannot acquire enough shared memory when it starts. See the manual pages for ipcs(1) and ipcrm(1) to find out how much shared memory is in use. See your system administrator to find out how much shared memory has been configured for your systems if you find that you cannot run a server with a buffer pool of adequate size, and no shared memory segments are being wasted. .br .sh 4 "The Size of Log Pages" .lp The log page size is determined when a log volume is formatted. For a transaction mix dominated by transactions that generate more than a few kilobytes of log information, the larger the log page size, the better. For short running transactions, such as those found in transaction processing benchmarks, 8 Kbyte log pages give good results. .br .sh 4 "Checkpoint Frequency" .lp The checkpoint frequency is based on the number of log pages written. The default frequency is every 100 log pages. The frequency can be determined by setting the \*(lqcheckpoint\*(rq configuration option. .(x z checkpoint frequency .)x \*($n .(x z default checkpoint frequency .)x \*($n It can be changed in a running server by an application that calls sm_ChangeCheckpointFrequency(\ ). More frequent checkpoints tend to shorten the time required to recover after a server fails at the expense of processing time during normal operation. Checkpoints also cause the server's dirty pages to be flushed to disk, which may also improve performance during normal operation. .bp .sh 1 "REFERENCES" .sp .ip "[Care86]" 10 M. Carey, D. DeWitt, J. Richardson, and E. Shekita, \fIObject and File Management in the EXODUS Extensible Database System\fR, \fBProc. of the 1986 VLDB Conf.\fR, Kyoto, Japan, Aug. 1986. .ip "[Care89]" 10 M. Carey, D. DeWitt, E. Shekita, \fIStorage Management for Objects in EXODUS\fR, \fBObject-Oriented Concepts, Databases, and Applications\fR, W. Kim and F. Lochovsky, eds., Addison-Wesley, 1989. .ip "[Chou85]" 10 H. Chou and D. Dewitt, \fIAn Evaluation of Buffer Management Strategies for Relational Database Systems\fR, \fBProc. of the 1985 VLDB Conf.\fR, Stockholm, Sweden, Aug. 1985. .ip "[Fran92]" 10 M. Franklin, M. Zwilling, C.K.Tan, M. Carey, and D. DeWitt, \fICrash Recovery in Client-Server EXODUS\fR, \fBProc. of the ACM SIGMOD Int'l. Conf. on Management of Data\fR, San Diego, CA, June 1992. .ip "[Gray78]" 10 J. N. Gray, \fINotes on Database Operating Systems\fR, \fBLecture Notes in Computer Science 60, Advanced course on Operating Systems\fR, ed. G. Seegmuller, Springer Verlag, New York 1978. .ip "[Gray88]" 10 J. Gray, R. Lorie, G. Putzolu, I. Traiger, \fIGranularity of Locks and Degrees of Consistency in a Shared Data Base\fR, \fBReadings in Database Systems\fR, ed. M. Stonebraker, Morgan Kaufmann, San Mateo, Ca., 1988. .ip "[Litw88]" 10 W. Litwin, \fILinear Hashing: A New Tool for File and Table Addressing\fR, \fBReadings in Database Systems\fR, ed. M. Stonebraker, Morgan Kaufmann, San Mateo, Ca., 1988. .ip "[Moha83]" 10 C. Mohan, B. Lindsay, \fIEfficient Commit Protocols for the Tree of Processes Model of Distributed Transactions\fR, \fBProc. 2nd ACM SIGACT/SIGOPS Symposium on Principles of Distributed Computing\fR, Montreal, Canada, August, 1983. .ip "[Moha89]" 10 C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, \fIARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging\fR, \fIACM Transactions on Database Systems\fR, Vol. 17, No 1, March 1992. .ip "[Rich87]" 10 J. Richardson and M. Carey, \fIProgramming Constructs for Database System Implementation in EXODUS\fR, \fBProc. of the ACM SIGMOD Int'l. Conf. on Management of Data\fR, San Francisco, CA, May 1987. .ip "[exoArch]" 10 \fIEXODUS Storage Manager Architecture Overview\fR, unpublished, included in EXODUS Storage Manager software release. .bp .\" use alphabetic section header A.1, A.2, etc. .af $1 A .nr $1 0 .af $9 A .nr $9 1 .sh 1 "APPENDIX : Locking Protocol for Storage Manager Operations" .lp The Storage Manager performs concurrency control using the standard hierarchical two-phase locking protocol (see [Gray78], [Gray88]) .(x z locking protocol .)x \*($n Appendix .(x z two-phase locking protocol .)x \*($x Appendix for locking files and object pages. The lock hierarchy contains two granularities: file-level, and page-level. Locking for index operations is performed with a non-two-phase protocol, that allows multiple clients to read and update the same index. This section describes the lock modes used in the system, lists the locks requested for each Storage Manager file and object operation, and explains how deadlocks are handled. .(x z deadlock .)x \*($n Appendix Lock acquisition and release are \fIimplicit\fR in all relevant operations, so clients cannot explicitly manage their own locks. .br .sh 2 "Lock Modes" .lp Files are locked in one of six modes: no lock (NL), shared (S), exclusive (X), intent to share (IS), intent to exclusive (IX), .(x z lock, exclusive .)x \*($n Appendix share with intent to exclusive (SIX) [Gray78], [Gray88]. Only shared and exclusive locks are obtained on pages. Determining whether two locks are compatible (eg., when a client holds a lock on a file and another client wants to obtain a lock on it as well) can be done using a table. Table \n($9.1 is a lock compatibility table for the six file lock modes. Each row indicates a lock that some client can hold, and each column indicates a lock desired by another client. The Y and N table entries indicate (yes or no) whether the locks are compatible or not. .\" ) to match open paren in Table reference above .(z .TS center, tab(#), box ; c s s s s s s c|c s s s s s c|c c c c c c l|l l l l l l. Lock#Lock Requested Held#NL#IS#IX#S#SIX#X _ NL#Y#Y#Y#Y#Y#Y IS#Y#Y#Y#Y#Y#N IX#Y#Y#Y#N#N#N S#Y#Y#N#Y#N#N SIX#Y#Y#N#N#N#N X#Y#N#N#N#N#N .TE .ce .uh "Table \n($9.1: Lock Compatibility" .\" ) to match open paren in .uh above .)z .lp Another table can be used to express lock convertibility. A lock conversion occurs when a client holds a lock in some mode and requests an operation that requires a different mode for the lock. Table \n($9.2 is a lock convertibility table for the six file lock modes. Each row indicates a lock that the client already holds and each column indicates the new lock mode requested. The entries represent the resulting lock mode obtained. .\" ) to match open paren Table ref above .(z .TS center, tab(#), box ; c s s s s s s c|c s s s s s c|c c c c c c l|l l l l l l. Lock#Lock Requested Held#NL#IS#IX#S#SIX#X _ NL#NL#IS#IX#S#SIX#X IS#IS#IS#IX#S#SIX#X IX# IX#IX#IX#SIX#SIX#X S# S#S#SIX#S#SIX#X SIX#SIX#SIX#SIX#SIX#SIX#X X#X#X#X#X#X#X .TE .ce .uh "Table \n($9.2: Lock Convertibility" .\" ) to match open paren in .uh above .)z .sp .br .sh 2 "Locks Obtained by Operations" .lp The locks mentioned above are obtained on two types of structures in the Storage Manager: files and pages. Only the pages that contain object headers and root entries are locked; large object data pages and file index pages are not locked. The entire root entry page is locked when a root entry is used. .lp Table \n($9.3 lists all of the locks .\" ) to match open paren above obtained by the various Storage Manager operations. The column labelled \*(lqFile Lock\*(rq indicates what lock mode is used for locking the file in question. The column labelled \*(lqPage Lock\*(rq indicates what lock mode is used for locking pages containing the objects or root entries in question. Locks are held until the end of the transaction in which they were acquired. .lp Some applications may find it necessary to acquire more restrictive locks on pages and files to avoid conflicts during lock-upgrade requests. For example, consider an application that reads an object (with sm_ReadObject(\ )) and subsequently writes it (with sm_WriteObject(\ )). When the object is read, a share lock is acquired for the object's page. .(x z lock, share .)x \*($n Appendix When the object is written, a lock-upgrade request is sent to the server to obtain an exclusive lock on the page. .(x z lock, exclusive .)x \*($n Appendix This extra message is relatively expensive and can lead to potential deadlock if other clients are locking the page as well. .(x z deadlock .)x \*($n Appendix To avoid this problem, the \*(lqpagelock\*(rq option can be used to change the default lock modes used when .(x z default lock mode .)x \*($n Appendix the client library locks a page. See Table 1 and the discussion of client options in Section 4.2, \fBInitialization and Shutdown Operations\fR for information about setting client options. See Appendix A for more information about lock modes and the Storage Manager's locking protocols. .(z .TS box, center, tab(#) ; c s s s c c c c l l l l. Operation#File Lock#Page Lock#Comments _ sm_Initialize(\ )#-#-#no locks needed sm_ShutDown(\ )#-#-#no locks needed sm_OpenBufferGroup(\ )#-#-#no locks needed sm_CloseBufferGroup(\ )#-#-#no locks needed sm_SetRootEntry(\ )#-#X#root entry page sm_GetRootEntry(\ )#-#S#root entry page sm_RemoveRootEntry(\ )#-#X#root entry page sm_CreateFile(\ )#X#-# sm_DestroyFile(\ )#X#-# sm_GetFirstOid(\ )#S#-# sm_GetLastOid(\ )#S#-# sm_GetNextOid(\ )#S#-# sm_GetPreviousOid(\ )#S#-# sm_OpenScan(\ )#S#-# sm_OpenScanWithGroup(\ )#S#-# sm_ScanNextObject(\ )#-#-#no locks needed sm_CloseScan(\ )#-#-#no locks needed sm_OpenLoad(\ )#X#-# sm_LoadNextObject(\ )#-#-#no locks needed sm_CloseLoad(\ )#-#-#no locks needed sm_CreateObject(\ )#IX#X#unordered file sm_DestroyObject(\ )#IX#X# sm_ReadObject(\ )#IS#S# sm_ReadObjectHeader(\ )#IS#S sm_ReleaseObject(\ )#-#-#no locks needed sm_WriteObject(\ )#IX#X# sm_InsertInObject(\ )#IX#X# sm_AppendToObject(\ )#IX#X# sm_DeleteFromObject(\ )#IX#X# sm_CreateVersion(\ )#IX#X# sm_FreezeVersion(\ )#IX#X# .TE .ce .uh "Table \n($9.3: Locks Obtained by Operations" .\" ) to match open paren above .)z .(x z locks obtained by functions .)x \*($n Table .sp .br .sh 2 "Deadlock Detection and Avoidance" .lp With each lock request, a server analyzes its local waits-for graph and detects local cycles, or \*(lqlocal deadlocks\*(rq. .(x z deadlock avoidance .)x \*($n Appendix .(x z deadlock detection .)x \*($n Appendix The request that would cause a deadlock is denied (returns esmFAILURE), and the client library returns esmLOCKCAUSEDDEADLOCK to the application in the global variable sm_errno. .lp Distributed transactions may also cause a deadlock. The servers do not detect deadlocks that involve other servers. Global deadlocks are avoided by timing out locks. Each request that awaits a lock is aged. When its age exceeds the time given by the client's \*(lqlocktimeout\*(rq option, the request is denied (returns esmFAILURE), and the client library returns esmLOCKBUSY to the application in the global variable sm_errno. .lp When an application's request fails with esmLOCKBUSY or esmLOCKCAUSEDDEADLOCK, the application must abort its transaction, to free the locks it holds, and it must start its transaction again. .sp 3 .bp .sh 1 "APPENDIX : Generation of Unique Numbers for OIDs" .lp The \*(lqunique\*(rq field of an OID is special 32-bit value that is generated when the object is created and used to detect instances where the OID has become dangling or corrupted. The values that are stored in \*(lqunique\*(rq fields are generated by Storage Manager servers. Disk volumes are partitioned into blocks of 32 pages, and for each partition a 32-bit counter is maintained. When a new page is allocated, it is allotted a range (100) of unique numbers to use during object creation. The counter in the partition containing the new page is incremented to reflect the allotment. When this allotment has been exhausted, a request is made to the server for another allotment. When an object is created in a particular partition, the \*(lqunique\*(rq field of the new object's OID is set to the next available number in the range on the page. While this strategy does not guarantee that OIDs are unique for all time, the probability of a dangling OID that maps to the same page and the same slot, and has the same \*(lqunique\*(rq field as a valid OID is very low. As a result, \*(lqunique\*(rq fields can be used virtually to guarantee the validity of an OID. We adopted this approach instead of using unique-for-all-time logical OIDs with a surrogate index in order to avoid the extra disk I/Os that might be needed to translate a logical OID to a physical address. .bp .++P .sp 0.5i .ce 1 \fBTABLE OF CONTENTS\fR .sp 2 .xp .\" .bp .\" .++P .\" .sp 0.5i .\" .ce 1 .\" \fBINDEX\fR .\" .sp 2 .\" .xp z